Re: Is SuperColumn necessary?

2010-05-06 Thread David Boxenhorn
That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and

Re: Cassandra Streaming Service

2010-05-06 Thread vineet daniel
For more details have a look here : http://wiki.apache.org/cassandra/Streaming ___ Vineet Daniel ___ Let your email find you On Wed, May 5, 2010 at 9:34 PM, Weijun Li weiju...@gmail.com wrote: Thank you Jonathan!

why is streaming done in 32 MB chunks ?

2010-05-06 Thread vineet daniel
Hi Just out of curiosity want to know why streaming is done with 32MB chunks and not with 16 or 64 MB chunks. Any specific reasons behind 32 MB or its just like that ? ___ Vineet Daniel ___ Let your email find you

Re: Is SuperColumn necessary?

2010-05-06 Thread Torsten Curdt
+1 on all of that On Thu, May 6, 2010 at 09:04, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions):

How to initialize the Cassandra

2010-05-06 Thread Dop Sun
Hi, I just discovered that the json file exported by sstable2json contains more than the data itself, like deletedAt values. I'm thinking whether there is a tool can import some initial data? When we are doing the typical RDBMS system, this is how we are doing: 1) Define the

Re: performance tuning - where does the slowness come from?

2010-05-06 Thread Vick Khera
On Wed, May 5, 2010 at 8:08 PM, Kyusik Chung kyu...@discovereads.com wrote: if the data from the sstables hasnt already been loaded into memory by mmap, load it into memory; if you're out of memory on the box, swap some of the old mmapped data out of memory mmap() does not copy your data into

Re: How to initialize the Cassandra

2010-05-06 Thread Jonathan Ellis
The simplest way is to just use thrift batch_mutate. If Cassandra CPU is your bottleneck then using the binary load method from StorageProxy can help (see contrib/bmt_example). If Casssandra disk or network is your bottleneck then binary load won't really help. On Thu, May 6, 2010 at 7:51 AM,

Re: performance tuning - where does the slowness come from?

2010-05-06 Thread Jonathan Ellis
columns, not CFs. put another way, how wide are the rows in the slow CF? On Wed, May 5, 2010 at 11:30 PM, Ran Tavory ran...@gmail.com wrote: I have a few CFs but the one I'm seeing slowness in, which is the one with plenty of cache misses has only one column per key. Latency varies b/w 10m

pagination through slices with deleted keys

2010-05-06 Thread Ian Kallen
I read the DistributedDeletes and the range_ghosts FAQ entry on the wiki which do a good job describing how difficult deletion is in an eventually consistent system. But practical application strategies for dealing with it aren't there (that I saw). I'm wondering how folks implement pagination in

Re: pagination through slices with deleted keys

2010-05-06 Thread Mark Greene
Hey Ian, I actually just wrote a quick example of how to iterate over a CF that may have tombstones. This may help you out: http://markjgreene.wordpress.com/2010/05/05/iterate-over-entire-cassandra-column-family/ On Thu, May 6, 2010 at 12:17 PM, Ian Kallen spidaman.l...@gmail.com wrote: I read

Re: replacing columns via remove and insert

2010-05-06 Thread Jonathan Shook
I found the issue. Timestamp ordering was broken because: I generated a timestamp for the group of operations. Then, I used hector's remove, which generates its own internal timestamp. I then re-used the timestamp, not wary of the missing timestamp field on the remove operation. The fix was to

Re: pagination through slices with deleted keys

2010-05-06 Thread Ian Kallen
Thanks Mark, great illustration. I'm already splitting my time developing directly with hector and a vastly simplified jython wrapper around it; I guess I'll address it at some wrapping layer (patch hector or let the jython layer deal). My grumpy editorial about this stuff is that on the

No column index in Cassandra?

2010-05-06 Thread Weijun Li
Hello, it seems that sstable index file only contains key/position and each sstable doesn't have column index. So how does range slice query work? Does it iterate through every key in the range for column name/value comparison? -Weijun

Re: performance tuning - where does the slowness come from?

2010-05-06 Thread Vick Khera
On Thu, May 6, 2010 at 1:06 PM, Weijun Li weiju...@gmail.com wrote: In this case using mmap will cause Cassandra to use sometimes 100G virtual memory which is much more than the physical ram, since we are using random partitioner the OS will be busy doing swap. mmap uses the virtual address

Re: performance tuning - where does the slowness come from?

2010-05-06 Thread Ran Tavory
Jonathan, I think it's the case of large values in the columns. The problematic CF is a key-value store, so it has only one column per row, however the value of that column can be large. It's a java serialized object (uncompressed) which, may be 100s of bytes, maybe even a few megs. This CF also

Re: performance tuning - where does the slowness come from?

2010-05-06 Thread Weijun Li
I just used Linux Top to see the number of virtual memory used by JVM. When you turned on mmap, this number is equal to the size of your live sstables. And if you turn off mmap the VIRT will be close to the xmx of your jvm. Anyway, for mmap, in order for you to access the data in the buffer or

Deletion batch mutate

2010-05-06 Thread Sonny Heer
The Deletion Class only has a setSuper_column method. Does this work with regular columns as well? if not, how do you add a mutation for column delete?

Re: Deletion batch mutate

2010-05-06 Thread Weijun Li
Mutation.ColumnOrSuperColumn takes either super column or regular column. On Thu, May 6, 2010 at 11:16 AM, Sonny Heer sonnyh...@gmail.com wrote: The Deletion Class only has a setSuper_column method. Does this work with regular columns as well? if not, how do you add a mutation for column

Re: No column index in Cassandra?

2010-05-06 Thread Rob Coli
On 5/6/10 10:35 AM, Weijun Li wrote: Hello, it seems that sstable index file only contains key/position and each sstable doesn't have column index. So how does range slice query work? Does it iterate through every key in the range for column name/value comparison? The column index is in the

Re: performance tuning - where does the slowness come from?

2010-05-06 Thread Kyusik Chung
Id like to add one caveat to Weijun's statement. I agree with everything, except if your access pattern doesnt look like a random sampling of data across all your sstables. If it turns out that at any given time, you're doing many repeated hits to a smaller subset of keys, then using mmap

Re: Cassandra training on May 21 in Palo Alto

2010-05-06 Thread S Ahmed
Do you have rough ideas when you would be doing the next one? Maybe in 1 or 2 months or much later? On Tue, May 4, 2010 at 8:50 PM, Jonathan Ellis jbel...@gmail.com wrote: Yes, although when and where are TBD. On Tue, May 4, 2010 at 7:38 PM, Mark Greene green...@gmail.com wrote:

Re: Cassandra training on May 21 in Palo Alto

2010-05-06 Thread uncle mantis
Next time you are in Houston, TX, COUNT ME IN! Regards, Michael On Tue, May 4, 2010 at 4:07 PM, Jonathan Ellis jbel...@gmail.com wrote: I'll be running a day-long Cassandra training class on Friday, May 21. I'll cover - Installation and configuration - Application design - Basics of

Re: replacing columns via remove and insert

2010-05-06 Thread Jonathan Ellis
That's kind of an odd API wart for Hector. You should file an issue on http://github.com/rantav/hector On Thu, May 6, 2010 at 11:36 AM, Jonathan Shook jsh...@gmail.com wrote: I found the issue. Timestamp ordering was broken because: I generated a timestamp for the group of operations. Then, I

Re: performance tuning - where does the slowness come from?

2010-05-06 Thread Jonathan Ellis
Yes, that makes sense. If you never have a warm cache then it's probably disk seek time creating that latency, in which case there isn't a whole lot you can do about it short of adding more capacity (so at least it's cached at the OS level). iostat -x could substantiate this guess. On Thu, May

Re: performance tuning - where does the slowness come from?

2010-05-06 Thread B. Todd Burruss
i think you will see a slow down because of large values in your columns. make sure you take a look at MemtableThroughputInMB in your config. if you are writing 1MB of data per row, then you'll probably want to increase this quite a bit so you are not constantly creating sstables. can't

Re: Updating (as opposed to just setting) Cassandra data via Hadoop

2010-05-06 Thread Jonathan Ellis
It sounds reasonable to me, with the caveat that I have only limited Hadoop knowledge. Please write up a blog post when you get it working. :) On Wed, May 5, 2010 at 10:44 PM, Mark Schnitzius mark.schnitz...@cxense.com wrote: Apologies, Hadoop recently deprecated a whole bunch of classes and I

Re: pagination through slices with deleted keys

2010-05-06 Thread Mike Malone
Our solution at SimpleGeo has been to hack Cassandra to (optionally, at least) be sensible and drop Rows that don't have any Columns. The claim from the FAQ that Cassandra would have to check if there are any other columns in the row is inaccurate. The common case for us at least is that we're

Re: Updating (as opposed to just setting) Cassandra data via Hadoop

2010-05-06 Thread Ian Kallen
I have inputs that are text logs and I wrote a Cassandra OutputFormat, the reducers read the old values from their respective column families, increment the counts and write back the new values. Since all of the writes are done by the hadoop jobs and we're not running multiple jobs concurrently,

Re: Updating (as opposed to just setting) Cassan dra data via Hadoop

2010-05-06 Thread Stu Hood
Ian: I think that as get_range_slice gets faster, the approach that Mark was heading toward may be considerably more efficient than reading the old value in the OutputFormat. Mark: Reading all of the data you want to update out of Cassandra using the InputFormat, merging it with (tagged) new

Re: Updating (as opposed to just setting) Cassandra data via Hadoop

2010-05-06 Thread Mark Schnitzius
Yes, I think this approach would be more efficient, but Ian's point about failed runs is well taken. It is still a problem with this approach. I may have to introduce a scheme where Hadoop's output is written to a new column family, and then some sort of pointer is updated to point to this

Re: Updating (as opposed to just setting) Cassan dra data via Hadoop

2010-05-06 Thread Stu Hood
Agreed. For the massaging, may I recommend using Pig? It is fantastic for unioning and reformatting datasets like these. -Original Message- From: Mark Schnitzius mark.schnitz...@cxense.com Sent: Thursday, May 6, 2010 6:36pm To: user@cassandra.apache.org Subject: Re: Updating (as opposed

Re: Updating (as opposed to just setting) Cassandra data via Hadoop

2010-05-06 Thread Brandon Williams
+1 for pig -Brandon On May 6, 2010 6:49 PM, Stu Hood stu.h...@rackspace.com wrote: Agreed. For the massaging, may I recommend using Pig? It is fantastic for unioning and reformatting datasets like these. -Original Message- From: Mark Schnitzius mark.schnitz...@cxense.com Sent:

Re: Is SuperColumn necessary?

2010-05-06 Thread philip andrew
Please create a new term word if the existing terms are misleading, if its not a file system then its not good to call it a file system. On Thu, May 6, 2010 at 3:50 PM, Torsten Curdt tcu...@vafer.org wrote: +1 on all of that On Thu, May 6, 2010 at 09:04, David Boxenhorn da...@lookin2.com

Re: Is SuperColumn necessary?

2010-05-06 Thread Vijay
I would rather be interested in Tree type structure where supercolumns have supercolumns in it. you dont need to compare all the columns to find a set of columns and will also reduce the bytes transfered for separator, at least string concatenation (Or something like that) for read and write

Virtualization vs. Cassandra and Hadloop

2010-05-06 Thread Dennis
Please check out this PNG image from attachment or from Google docs: http://docs.google.com/drawings/pub?id=1P3jdSddseG1oSYrtjREWcajizxmxoRIhUHCEw4sDi3kw=771h=624So, what I want to do is something like a private cloud storage solution.I belive the http servers and application servers should be

Re: Cassandra training on May 21 in Palo Alto

2010-05-06 Thread Jonathan Shook
Dallas On Thu, May 6, 2010 at 4:28 PM, Jonathan Ellis jbel...@gmail.com wrote: We're planning that now.  Where would you like to see one? On Thu, May 6, 2010 at 2:40 PM, S Ahmed sahmed1...@gmail.com wrote: Do you have rough ideas when you would be doing the next one?  Maybe in 1 or 2 months

Re: pagination through slices with deleted keys

2010-05-06 Thread Mike Malone
On Thu, May 6, 2010 at 3:27 PM, Ian Kallen spidaman.l...@gmail.com wrote: Cool, is this a patch you've applied on the server side? Are you running 0.6.x? I'm wondering if this kind of thing can make it into future versions of Cassandra. Yea, server side. It's basically doing the same thing