Re: Incr/Decr Counters in Cassandra

2010-03-13 Thread Ryan Daum
I also desperately need this, and row/column TTLs.  Let me know if there's
anything I can do to help with the release of either in the near-term time
frame.

R

On Sat, Mar 13, 2010 at 2:21 AM, Vijay  wrote:

> Badly need it for my work let me know if i can do something to speed it up
> :)
>
> Regards,
> 
>
>
>
> On Wed, Nov 4, 2009 at 1:32 PM, Chris Goffinet  wrote:
>
> > Hey,
> >
> > At Digg we've been thinking about counters in Cassandra. In a lot of our
> > use cases we need this type of support from a distributed storage system.
> > Anyone else out there who has such needs as well? Zookeeper actually has
> > such support and we might use that if we can't get the support in
> Cassandra.
> >
> > ---
> > Chris Goffinet
> > goffi...@digg.com
> >
> >
> >
> >
> >
> >
>


Re: Looking for work

2010-03-02 Thread Ryan Daum
Maybe the wiki needs a job board ?

On Tue, Mar 2, 2010 at 10:15 PM, Joe Stump  wrote:

> Us too at SimpleGeo! We're Python, Cassandra, Erlang, and a smattering of
> Java and C++.
>
> We have offices in Boulder, CO and SF.
>
> --Joe
>
> --
> Typed with big fingers on a small keyboard.
>
>
> On Mar 2, 2010, at 19:01, Peter Halliday 
> wrote:
>
>  I'm looking for work.  My previous employer was a non-profit that lost
>> funding and my position was cut.  I would love to find a position that
>> utilizes Cassandra.  I have experience in programming using Python, Perl,
>> PHP, and C/C++ (mostly Python and Perl).  I have experiencing with system
>> and network administration as well.  I certainly would be willing to send a
>> resume talking about my experience more.
>>
>>
>> Peter Halliday
>> Excelsior Systems
>> (Phone:) 607-936-2172
>> (Cell:) 607-329-6905
>> (Fax:) 607-398-7928
>>
>


Re: map/reduce on Cassandra

2010-01-25 Thread Ryan Daum
On Mon, Jan 25, 2010 at 2:18 PM, Brandon Williams  wrote:

> bin/sstablekeys will dump just the keys from an sstable without row
> deserialization overhead, but it can't introspect a commitlog.
> -Brandon

Yes, and will it not also return the keys that are replicas from
ranges 'belonging' to other nodes? I.e. running it on all boxes across
a cluster of  with an RF > 1 would return duplicates where the data
was replicated. Needs a flag to indicate uniqueness.

Ryan


Re: map/reduce on Cassandra

2010-01-25 Thread Ryan Daum
I agree with what Jeff says here about RandomPartitioner support being key.

For my purposes with map/reduce I'd personally be fine with some
general all-keys dump utility that wrote contents of one node to a
file, and then just write my own integration from that file into
Hadoop, etc..

I guess I'm thinking something similar to sstable2json except that
unfortunately sstable2json will dump replica data not just the local
node's data. Getting the contents of the commitlog into the file would
be nice, too.

R

On Mon, Jan 25, 2010 at 1:47 PM, Jeff Hodges  wrote:
> 1) Works with RandomPartitioner. This is huge and the only way almost
> everyone would able to use it.
> 2) Ability to divide up the keys of a single node to more than one
> mapper. The prototype just slurped up everything on the node. This
> would probably be easiest to not allow as a configurable thing and
> just let it be part of the InputSplit calculation.
> 3) Progress information should be calculated and displayed.
>  --
> Jeff
>
> On Mon, Jan 25, 2010 at 5:43 AM, Phillip Michalak
>  wrote:
>> Multiple people have expressed an interest in 'hadoop integration' and
>> 'map/reduce functionality' within Cassandra. I'd like to get a feel for what
>> that means to different people.
>>
>> As a starting point for discussion, Jeff Hodges undertook a prototype effort
>> last summer which was the subject of this thread:
>> http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3cf5f3a6290907240123y22f065edp1649f7c5c1add...@mail.gmail.com%3e.
>>
>> Jeff explicitly mentions data locality as one of the things that was out of
>> scope for the prototype. What other features or characteristics would you
>> expect to see in an implementation?
>>
>> Thanks,
>> Phil
>>
>


Re: How to lower the number of has reached it's threshold ?

2010-01-20 Thread Ryan Daum
>
>
>
> The second question is really direct : do you think it's insane to use EC2
> small instance to build a cassandra cluster. The machine are virtualized
> with 2G memory
>
>
We are using 6 ec2 small instances + EBS to build a cluster. I can't say
it's seen significant stress -- perhaps 50 writes a second, but performance
has been acceptable for us on writes. Reads are not as fast as I would like
-- 3600 row keys with 60,000 super columns retrieved in about 10seconds, but
again, acceptable for our purposes.

I'd be interested to see how you net out and what you decide for
configuration.  Large instances are unfortunately quite a bit more expensive
than small, but at least the have advantage of 64-bit CPU and more memory.

Ryan


Re: Cassandra and TTL

2010-01-13 Thread Ryan Daum
On Wed, Jan 13, 2010 at 6:19 PM, Jonathan Ellis  wrote:

> If he needs column-level granularity then I don't see any other option.
>
> If he needs CF-level granularity then truncate will work fine. :)
>

Are you saying the proposed truncate functionality will support the
functionality of 'truncate all keys with timestamp < X" ?

R


Re: Tuning and upgrades

2010-01-13 Thread Ryan Daum
Sounds like you have a similar configuration to us.

We have 6 EC2 small instances, with EBS for storage.

Nothing scientific for benchmarks right now, but typically we can retrieve
60,000 columns scattered across 3600 row keys in about 7-10 seconds.

Writes haven't been a bottleneck at all.

I also have a key distribution issue similar to what you describe. So I will
be attempting the same recipe as you shortly.

I'm very interested in what your experiences are running Cassandra on EC2.

Ryan

On Wed, Jan 13, 2010 at 2:26 PM, Anthony Molinaro <
antho...@alumni.caltech.edu> wrote:

> Hi,
>
>  So after several days of more close examination, I've discovered
> something.  EC2 io performance is pretty bad.  Well okay, we already
> all knew that, and I have no choice but to deal with it, as moving
> at this time is not an option.  But what I've really discovered is
> my data is unevenly distributed which I believe is a result of using
> random partitioning without specifying tokens.  So what I think I can
> do to solve this is upgrade to 0.5.0rc3, add more instances, and use
> the tools to modify token ranges.   Towards that end I had a few
> questions about different topics.
>
> Data gathering:
>
>  When I run cfstats I get something like this
>
>  Keyspace: 
>Read Count: 39287
>Read Latency: 14.588 ms.
>Write Count: 13930
>Write Latency: 0.062 ms.
>
>  on a heavily loaded node and
>
>  Keyspace: 
>Read Count: 8672
>Read Latency: 1.072 ms.
>Write Count: 2126
>Write Latency: 0.000 ms.
>
>  on a lightly loaded node, but my question is what is the timeframe
>  of the counts?  Does a read count of 8K say that 8K reads are currently
>  in progress, or 8K since the last time I check or 8K for some interval?
>
> Data Striping:
>
>  One option I have is to add additional ebs volumes, then either turn
>  on raid0 across several ebs's or possibly just add additional
>   elements to my config?  If I were to add
>   entries, can I just move sstable's between
>  directories?  If so I assume I want the Index, Filter and Data files
>  to be in the same directory?  Or is this data movement something
>  Cassandra will do for me?  Also, is this likely to help?
>
> Upgrades:
>
>  I understand that to upgrade from 0.4.x to 0.5.x I need to do something
>  like
>
>1. turn off all writes to a node
>2. call 'nodeprobe flush' on that node
>3. restart node with version 0.5.x
>
>  Is this correct?
>
> Data Repartitioning:
>
>  So it seems that if I first upgrade my current nodes to 0.5.0, then
>  bring up some new nodes with AutoBootstrap on, they should take some
>  data from the most loaded machines?  But lets say I just want to first
>  even out the load on existing nodes, would the process be something like
>
>1. calculate ideal key ranges (ie, i * (2**127 /N) for i=1..N)
> (this seems like the ideal candidate for a new tool included
>  with cassandra).
>2. foreach node
> 'nodeprobe move' to ideal range
>3. foreach node
> 'nodeprobe clean'
>
>  Alternatively, it looks like I might be able to use 'nodeprobe
> loadbalance'
>  for step 2, and not use step 1?
>
> Also, anyone else running in EC2 and have any sort of tuning tips?
>
> Thanks,
>
> -Anthony
>
> --
> 
> Anthony Molinaro   
>


Re: Data Model Index Text

2010-01-13 Thread Ryan Daum
The only tricky point I saw with 3.0 Lucene switch was that the TokenStream
API changed completely, and IndexWriter in your code depended on the old
API.

I've ruled out OrderPreservingPartitioner for other jobs of mine because
distribution of keys is likely not ideal across my cluster. I'm curious with
Lucandra if the keys truly distribute well?

R

On Wed, Jan 13, 2010 at 10:26 AM, Jake Luciani  wrote:

> It "should" work but not a ton has changed in 2.9/3.0 AFAIK.  I'm going to
> work on updating Lucandra to work with 0.5 branch I can try to update this
> as well.  BTW, if you want to see Lucandra in action check out
> http://flocking.me (example: http://flocking.me/tjake )
>
> You can use a random partitioner if you store the entire index under a
> supercolumn (how it was originally implemented) but then you need to accept
> the entire index will be in memory for any operation on that index (bad for
> big indexes).
>
> -Jake
>
>
> On Wed, Jan 13, 2010 at 9:14 AM, Ryan Daum  wrote:
>
>> On the topic of Lucandra, apart from having it work with 0.5 of Cassandra,
>> has any work been done to get it up to date with Lucene 2.9/3.0?
>>
>> Also, I'm a bit concerned about its use of OrderPreservingPartitioner; is
>> there an architecture for storage that could be considered that would work
>> with RandomPartitioner?
>>
>> Ryan
>>
>>
>> On Tue, Jan 12, 2010 at 12:20 PM, ML_Seda  wrote:
>>
>>>
>>> i do see the classes now, but All the way back in version .20.  Is there
>>> a
>>> newer version of Lucandra.  It would be nice for us to use the lastest
>>> cassandra (trunk).
>>> --
>>> View this message in context:
>>> http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4293071.html
>>> Sent from the cassandra-user@incubator.apache.org mailing list archive
>>> at Nabble.com.
>>>
>>
>>
>


Re: Cassandra and TTL

2010-01-13 Thread Ryan Daum
Just to speak up here, I think it's a more common use-case than you're
imagining, eve if maybe there's no reasonable way of implementing it.

I for one have plenty of use for a TTL on a key, though in my case the TTL
would be in days/weeks.

Alternatively, I know it's considered "wrong", but having a way of getting
all unique keys + timestamps from a RandomPartitioner would allow me to do
manual scavenging of my own. sstable2json is perhaps not appropriate because
it includes replicated data.

On Tue, Jan 12, 2010 at 11:56 AM, Jonathan Ellis  wrote:

> I'm skeptical that this is a common use-case...  If truncating old
> sstables entirely
> (https://issues.apache.org/jira/browse/CASSANDRA-531) meets your
> needs, that is going to be less work and more performant.
>
> -Jonathan
>
> On Tue, Jan 12, 2010 at 10:45 AM, Sylvain Lebresne 
> wrote:
> > Hello,
> >
> > I have to deal with a lot of different data and Cassandra seems to be a
> good
> > fit for my needs so far. However, some of this data is volatile by nature
> and
> > for those, I would need to set something akin to a TTL. Those TTL could
> be
> > long, but keeping those data forever would be useless.
> >
> > I could deal with that by hand, writing some daemon that run regularly
> and
> > remove what should be removed. However this is not particularly
> efficient, nor
> > convenient, and I would find it really cool to be able to provide a TTL
> when
> > inserting something and don't have to care more than that.
> >
> > Which leads me to my question: why Cassandra doesn't allow to set a TTL
> for
> > data ? Is it for technical reason ? For philosophical reason ? Or just
> nobody
> > had needed it sufficiently to write it ?
> >
> > From what I understand of how Cassandra works, it seems to me that it
> > could be done pretty efficiently (even though I agree that it wouldn't
> > be a minor
> > change). That is, it would require to add a ttl to column (and/or row).
> When
> > reading a column whose timestamp + ttl is expired, it would ignore it (as
> for
> > tombstoned column). Then during compaction, expired column would be
> > collected.
> >
> > Is there any major difficulties/obstacles I don't see ?
> > Or maybe is there some trick I don't know about that allow to do such a
> thing
> > already ?
> >
> > And if not, would that be something that would interest the Cassandra
> > community ? Or does nobody ever need such a thing ? (I personally believe
> it
> > to be a desirable feature, but maybe I am the only one.)
> >
> > Thanks,
> > Sylvain
> >
>


Re: Data Model Index Text

2010-01-13 Thread Ryan Daum
On the topic of Lucandra, apart from having it work with 0.5 of Cassandra,
has any work been done to get it up to date with Lucene 2.9/3.0?

Also, I'm a bit concerned about its use of OrderPreservingPartitioner; is
there an architecture for storage that could be considered that would work
with RandomPartitioner?

Ryan

On Tue, Jan 12, 2010 at 12:20 PM, ML_Seda  wrote:

>
> i do see the classes now, but All the way back in version .20.  Is there a
> newer version of Lucandra.  It would be nice for us to use the lastest
> cassandra (trunk).
> --
> View this message in context:
> http://n2.nabble.com/Data-Model-Index-Text-tp4275199p4293071.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>