That would be a good time to get rid of the confusing column term, which
incorrectly suggests a two-dimensional tabular structure.
Suggestions:
1. A hypercube (or hypocube, if only two dimensions): replace key and
column with 1st dimension, 2nd dimension, etc.
2. A file system: replace key and
For more details have a look here :
http://wiki.apache.org/cassandra/Streaming
___
Vineet Daniel
___
Let your email find you
On Wed, May 5, 2010 at 9:34 PM, Weijun Li weiju...@gmail.com wrote:
Thank you Jonathan!
Hi
Just out of curiosity want to know why streaming is done with 32MB chunks
and not with 16 or 64 MB chunks. Any specific reasons behind 32 MB or its
just like that ?
___
Vineet Daniel
___
Let your email find you
+1 on all of that
On Thu, May 6, 2010 at 09:04, David Boxenhorn da...@lookin2.com wrote:
That would be a good time to get rid of the confusing column term, which
incorrectly suggests a two-dimensional tabular structure.
Suggestions:
1. A hypercube (or hypocube, if only two dimensions):
Hi,
I just discovered that the json file exported by sstable2json contains more
than the data itself, like deletedAt values.
I'm thinking whether there is a tool can import some initial data?
When we are doing the typical RDBMS system, this is how we are doing:
1) Define the
On Wed, May 5, 2010 at 8:08 PM, Kyusik Chung kyu...@discovereads.com wrote:
if the data from the sstables hasnt already been loaded into memory by mmap,
load it into memory; if you're out of memory on the box, swap some of the
old mmapped data out of memory
mmap() does not copy your data into
The simplest way is to just use thrift batch_mutate.
If Cassandra CPU is your bottleneck then using the binary load method
from StorageProxy can help (see contrib/bmt_example).
If Casssandra disk or network is your bottleneck then binary load
won't really help.
On Thu, May 6, 2010 at 7:51 AM,
columns, not CFs.
put another way, how wide are the rows in the slow CF?
On Wed, May 5, 2010 at 11:30 PM, Ran Tavory ran...@gmail.com wrote:
I have a few CFs but the one I'm seeing slowness in, which is the one with
plenty of cache misses has only one column per key.
Latency varies b/w 10m
I read the DistributedDeletes and the range_ghosts FAQ entry on the wiki
which do a good job describing how difficult deletion is in an eventually
consistent system. But practical application strategies for dealing with it
aren't there (that I saw). I'm wondering how folks implement pagination in
Hey Ian,
I actually just wrote a quick example of how to iterate over a CF that may
have tombstones. This may help you out:
http://markjgreene.wordpress.com/2010/05/05/iterate-over-entire-cassandra-column-family/
On Thu, May 6, 2010 at 12:17 PM, Ian Kallen spidaman.l...@gmail.com wrote:
I read
I found the issue. Timestamp ordering was broken because:
I generated a timestamp for the group of operations. Then, I used
hector's remove, which generates its own internal timestamp.
I then re-used the timestamp, not wary of the missing timestamp field
on the remove operation.
The fix was to
Thanks Mark, great illustration. I'm already splitting my time developing
directly with hector and a vastly simplified jython wrapper around it; I
guess I'll address it at some wrapping layer (patch hector or let the jython
layer deal).
My grumpy editorial about this stuff is that on the
Hello, it seems that sstable index file only contains key/position and each
sstable doesn't have column index. So how does range slice query work? Does
it iterate through every key in the range for column name/value comparison?
-Weijun
On Thu, May 6, 2010 at 1:06 PM, Weijun Li weiju...@gmail.com wrote:
In this case using mmap will cause Cassandra to use sometimes 100G virtual
memory which is much more than the physical ram, since we are using random
partitioner the OS will be busy doing swap.
mmap uses the virtual address
Jonathan, I think it's the case of large values in the columns. The
problematic CF is a key-value store, so it has only one column per row,
however the value of that column can be large. It's a java serialized object
(uncompressed) which, may be 100s of bytes, maybe even a few megs. This CF
also
I just used Linux Top to see the number of virtual memory used by JVM.
When you turned on mmap, this number is equal to the size of your live
sstables. And if you turn off mmap the VIRT will be close to the xmx of your
jvm.
Anyway, for mmap, in order for you to access the data in the buffer or
The Deletion Class only has a setSuper_column method. Does this work
with regular columns as well? if not, how do you add a mutation for
column delete?
Mutation.ColumnOrSuperColumn takes either super column or regular column.
On Thu, May 6, 2010 at 11:16 AM, Sonny Heer sonnyh...@gmail.com wrote:
The Deletion Class only has a setSuper_column method. Does this work
with regular columns as well? if not, how do you add a mutation for
column
On 5/6/10 10:35 AM, Weijun Li wrote:
Hello, it seems that sstable index file only contains key/position and
each sstable doesn't have column index. So how does range slice query
work? Does it iterate through every key in the range for column
name/value comparison?
The column index is in the
Id like to add one caveat to Weijun's statement. I agree with everything,
except if your access pattern doesnt look like a random sampling of data across
all your sstables. If it turns out that at any given time, you're doing many
repeated hits to a smaller subset of keys, then using mmap
Do you have rough ideas when you would be doing the next one? Maybe in 1 or
2 months or much later?
On Tue, May 4, 2010 at 8:50 PM, Jonathan Ellis jbel...@gmail.com wrote:
Yes, although when and where are TBD.
On Tue, May 4, 2010 at 7:38 PM, Mark Greene green...@gmail.com wrote:
Next time you are in Houston, TX, COUNT ME IN!
Regards,
Michael
On Tue, May 4, 2010 at 4:07 PM, Jonathan Ellis jbel...@gmail.com wrote:
I'll be running a day-long Cassandra training class on Friday, May 21.
I'll cover
- Installation and configuration
- Application design
- Basics of
That's kind of an odd API wart for Hector. You should file an issue
on http://github.com/rantav/hector
On Thu, May 6, 2010 at 11:36 AM, Jonathan Shook jsh...@gmail.com wrote:
I found the issue. Timestamp ordering was broken because:
I generated a timestamp for the group of operations. Then, I
Yes, that makes sense. If you never have a warm cache then it's
probably disk seek time creating that latency, in which case there
isn't a whole lot you can do about it short of adding more capacity
(so at least it's cached at the OS level).
iostat -x could substantiate this guess.
On Thu, May
i think you will see a slow down because of large values in your
columns. make sure you take a look at MemtableThroughputInMB in your
config. if you are writing 1MB of data per row, then you'll probably
want to increase this quite a bit so you are not constantly creating
sstables. can't
It sounds reasonable to me, with the caveat that I have only limited
Hadoop knowledge.
Please write up a blog post when you get it working. :)
On Wed, May 5, 2010 at 10:44 PM, Mark Schnitzius
mark.schnitz...@cxense.com wrote:
Apologies, Hadoop recently deprecated a whole bunch of classes and I
Our solution at SimpleGeo has been to hack Cassandra to (optionally, at
least) be sensible and drop Rows that don't have any Columns. The claim from
the FAQ that Cassandra would have to check if there are any other columns
in the row is inaccurate. The common case for us at least is that we're
I have inputs that are text logs and I wrote a Cassandra OutputFormat, the
reducers read the old values from their respective column families,
increment the counts and write back the new values. Since all of the writes
are done by the hadoop jobs and we're not running multiple jobs
concurrently,
Ian: I think that as get_range_slice gets faster, the approach that Mark was
heading toward may be considerably more efficient than reading the old value in
the OutputFormat.
Mark: Reading all of the data you want to update out of Cassandra using the
InputFormat, merging it with (tagged) new
Yes, I think this approach would be more efficient, but Ian's point about
failed runs is well taken. It is still a problem with this approach. I may
have to introduce a scheme where Hadoop's output is written to a new column
family, and then some sort of pointer is updated to point to this
Agreed.
For the massaging, may I recommend using Pig? It is fantastic for unioning and
reformatting datasets like these.
-Original Message-
From: Mark Schnitzius mark.schnitz...@cxense.com
Sent: Thursday, May 6, 2010 6:36pm
To: user@cassandra.apache.org
Subject: Re: Updating (as opposed
+1 for pig
-Brandon
On May 6, 2010 6:49 PM, Stu Hood stu.h...@rackspace.com wrote:
Agreed.
For the massaging, may I recommend using Pig? It is fantastic for unioning
and reformatting datasets like these.
-Original Message-
From: Mark Schnitzius mark.schnitz...@cxense.com
Sent:
Please create a new term word if the existing terms are misleading, if its
not a file system then its not good to call it a file system.
On Thu, May 6, 2010 at 3:50 PM, Torsten Curdt tcu...@vafer.org wrote:
+1 on all of that
On Thu, May 6, 2010 at 09:04, David Boxenhorn da...@lookin2.com
I would rather be interested in Tree type structure where supercolumns have
supercolumns in it. you dont need to compare all the columns to find a
set of columns and will also reduce the bytes transfered for separator, at
least string concatenation (Or something like that) for read and write
Please check out this PNG image from attachment or from Google docs:
http://docs.google.com/drawings/pub?id=1P3jdSddseG1oSYrtjREWcajizxmxoRIhUHCEw4sDi3kw=771h=624So,
what I want to do is something like a private cloud storage solution.I belive
the http servers and application servers should be
Dallas
On Thu, May 6, 2010 at 4:28 PM, Jonathan Ellis jbel...@gmail.com wrote:
We're planning that now. Where would you like to see one?
On Thu, May 6, 2010 at 2:40 PM, S Ahmed sahmed1...@gmail.com wrote:
Do you have rough ideas when you would be doing the next one? Maybe in 1 or
2 months
On Thu, May 6, 2010 at 3:27 PM, Ian Kallen spidaman.l...@gmail.com wrote:
Cool, is this a patch you've applied on the server side? Are you running
0.6.x? I'm wondering if this kind of thing can make it into future versions
of Cassandra.
Yea, server side. It's basically doing the same thing
37 matches
Mail list logo