strange get_range_slices behaviour v0.6.1

2010-04-25 Thread aaron

I've been looking at the get_range_slices feature and have found some odd
behaviour I do not understand. Basically the keys returned in a range query
do not match what I would expect to see. I think it may have something to
do with the ordering of keys that I don't know about, but I'm just
guessing. 

On Cassandra v 0.6.1, single node local install; RandomPartitioner. Using
Python and my own thin wrapper around the Thrift Python API. 

Step 1. 

Insert 3 keys into the "Standard 1" column family, called "object 1"
"object 2" and "object 3", each with a single column called 'name' with a
value like 'object1'

Step 2. 

Do a get_range_slices call in the "Standard 1" CF, for column names
["name"] with start_key "object1" and end_key "object3". I expect to see
three results, but I only see results for object1 and object2. Below are
the thrift types I'm passing into the Cassandra.Client object...

- ColumnParent(column_family='Standard1', super_column=None)
- SlicePredicate(column_names=['name'], slice_range=None)
- KeyRange(end_key='object3', start_key='object1', count=4000,
end_token=None, start_token=None)

and the output 

[KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250258810439,
name='name', value='object1'), super_column=None)], key='object1'),
KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250271620362,
name='name', value='object3'), super_column=None)], key='object3')]

Step 3. 

Modify the get_range_slices call, so the start_key is object2. In this case
I expect to see 2 rows returned, but I get 3. Thrift args and return are
below...

- ColumnParent(column_family='Standard1', super_column=None)
- SlicePredicate(column_names=['name'], slice_range=None)
- KeyRange(end_key='object3', start_key='object2', count=4000,
end_token=None, start_token=None)

and the output 

[KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250265190715,
name='name', value='object2'), super_column=None)], key='object2'),
KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250258810439,
name='name', value='object1'), super_column=None)], key='object1'),
KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272250271620362,
name='name', value='object3'), super_column=None)], key='object3')]



Can anyone explain these odd results? As I said I've got my own python
wrapper around the client, so I may be doing something wrong. But I've
pulled out the thrift objects and they go in and out of the thrift
Cassandra.Client, so I think I'm ok. (I have not noticed a systematic
problem with my wrapper). 

On a more general note, is there information on the sort order of keys when
using key ranges? I'm guessing the hash of the keys is compared and I
wondering if the hash's of the keys maintain the order of the original
values? Also I assume the order is byte order, rather than ascii or utf8. 

I was experimenting with the difference between column slicing and key
slicing. In my I could write the keys in as column names (they are in
buckets) as well and slice there first, then use the results to to make a
multi key get. I'm trying to support features like, get me all the data
where the key starts with "foo.bar".

Thanks for the fun project. 

Aaron



Re: Re: how to store file in the cassandra?

2010-04-25 Thread Bingbing Liu
thanks , 


2010-04-26 



Bingbing Liu 



发件人: Jonathan Ellis 
发送时间: 2010-04-26  09:29:28 
收件人: user 
抄送: 
主题: Re: how to store file in the cassandra? 
 
Cassandra stores byte arrays.  You can certainly store file data in
it, although if it is larger than a few MB you should chunk it into
multiple columns.
On Sun, Apr 25, 2010 at 8:21 PM, Shuge Lee  wrote:
> Yes.
>
> Cassandra does save raw string data only, not a file, and shouldn't save a
> file.
>
> 2010/4/26 刘兵兵 
>>
>> sorry i'm not very familiar with python, are you meaning that the files
>> are stored in the file system of the os?
>>
>> then , the cassandra just stores the path to access the files?
>>
>>
>> On Mon, Apr 26, 2010 at 8:57 AM, Shuge Lee  wrote:
>>>
>>> In Python:
>>>
>>> keyspace.columnfamily[key][column] = value
>>>
>>> files.video[uuid.uuid4()]['name'] = 'foo.flv'
>>> files.video[uuid.uuid4()]['path'] = '/var/files/foo.flv'
>>>
>>> create a mapping
>>> files.video =  {
>>> uuid.uuid4() : {
>>> 'name' : 'foo.flv',
>>> 'path' : '/var/files/foo.flv',
>>> }
>>> }
>>>
>>> if most of sizes >= 0.5MB, use sys-fs/reiser4progs, else use ext4.
>>>
>>>
>>> 2010/4/26 Bingbing Liu 

 any suggestion?

 2010-04-26
 
 Bingbing Liu
>>>
>>>
>>> --
>>> Shuge Lee | Lee Li | 李蠡
>>
>>
>>
>> --
>> Bingbing Liu
>>
>> Web and Mobile Data Management lab
>>
>> Renmin University  of  China
>
>
>
> --
> Shuge Lee | Lee Li | 李蠡
>
<<14.gif>>

Re: how to store file in the cassandra?

2010-04-25 Thread Jonathan Ellis
Cassandra stores byte arrays.  You can certainly store file data in
it, although if it is larger than a few MB you should chunk it into
multiple columns.

On Sun, Apr 25, 2010 at 8:21 PM, Shuge Lee  wrote:
> Yes.
>
> Cassandra does save raw string data only, not a file, and shouldn't save a
> file.
>
> 2010/4/26 刘兵兵 
>>
>> sorry i'm not very familiar with python, are you meaning that the files
>> are stored in the file system of the os?
>>
>> then , the cassandra just stores the path to access the files?
>>
>>
>> On Mon, Apr 26, 2010 at 8:57 AM, Shuge Lee  wrote:
>>>
>>> In Python:
>>>
>>> keyspace.columnfamily[key][column] = value
>>>
>>> files.video[uuid.uuid4()]['name'] = 'foo.flv'
>>> files.video[uuid.uuid4()]['path'] = '/var/files/foo.flv'
>>>
>>> create a mapping
>>> files.video =  {
>>>     uuid.uuid4() : {
>>>     'name' : 'foo.flv',
>>>     'path' : '/var/files/foo.flv',
>>>     }
>>> }
>>>
>>> if most of sizes >= 0.5MB, use sys-fs/reiser4progs, else use ext4.
>>>
>>>
>>> 2010/4/26 Bingbing Liu 

 any suggestion?

 2010-04-26
 
 Bingbing Liu
>>>
>>>
>>> --
>>> Shuge Lee | Lee Li | 李蠡
>>
>>
>>
>> --
>> Bingbing Liu
>>
>> Web and Mobile Data Management lab
>>
>> Renmin University  of  China
>
>
>
> --
> Shuge Lee | Lee Li | 李蠡
>


Re: how to store file in the cassandra?

2010-04-25 Thread Shuge Lee
Yes.

Cassandra does save raw string data only, not a file, and shouldn't save a
file.

2010/4/26 刘兵兵 

> sorry i'm not very familiar with python, are you meaning that the files are
> stored in the file system of the os?
>
> then , the cassandra just stores the path to access the files?
>
>
>
> On Mon, Apr 26, 2010 at 8:57 AM, Shuge Lee  wrote:
>
>> In Python:
>>
>> keyspace.columnfamily[key][column] = value
>>
>> files.video[uuid.uuid4()]['name'] = 'foo.flv'
>> files.video[uuid.uuid4()]['path'] = '/var/files/foo.flv'
>>
>> create a mapping
>> files.video =  {
>> uuid.uuid4() : {
>> 'name' : 'foo.flv',
>> 'path' : '/var/files/foo.flv',
>> }
>> }
>>
>> if most of sizes >= 0.5MB, use sys-fs/reiser4progs, else use ext4.
>>
>>
>> 2010/4/26 Bingbing Liu 
>>
>>  any suggestion?
>>>
>>> 2010-04-26
>>> --
>>> Bingbing Liu
>>>
>>
>>
>>
>> --
>> Shuge Lee | Lee Li | 李蠡
>>
>
>
>
> --
> Bingbing Liu
>
> Web and Mobile Data Management lab
>
> Renmin University  of  China
>



-- 
Shuge Lee | Lee Li | 李蠡


when i use the OrderPreservingPartition, the load is very imbalance

2010-04-25 Thread 刘兵兵
i do some INSERT ,because i will do some scan operations, i use the
OrderPreservingPartition method.

the state of the cluster is showed below.

as i predicated the load is very imbalance, and some of the nodes down (in
some nodes,the Cassandra processes died and in others the processes are

alive but they still down),

so i have two questions:

1)how to do balance after the insert ends?

2)why the nodes died? how to make them up again (when the situation is that
the process is alive but the node'state is down)

thx

10.37.17.241  Up 47.65 GB
0p6ovvUXMJ4cdd1L   |<--|
10.37.17.234  Up 67.41 GB
5OxiS2DKBZLeISPg  |   ^
10.37.17.235  Up 67.54 GB
7UDcS0SToePuQACe v   |
10.37.17.246  Up 555 bytes
OCvC3nqKLeKA5n0I   |   ^
10.37.17.233  Up 830 bytes
SJp6cQRNox52av2Y   v   |
10.37.17.249  Up 830 bytes
SxVmCVcruOpoS48B   |   ^
10.37.17.247  Up 555 bytes
TGctCMvfNuRo7RjS   v   |
10.37.17.245  Up 555 bytes
j2smY0OOtQ0SeeHY   |   ^
10.37.17.250  Up 830 bytes
jNwBPchW58i5tGxp   v   |
10.37.17.248  Up 830 bytes
jYWaJC93OyMdWDaN   |   ^
10.37.17.237  Up 830 bytes
mPwhLOsKlbPart6j   v   |
10.37.17.236  Up 830 bytes
noh0t8HJgw4hmz7I   |   ^
10.37.17.244  Up 555 bytes
q8c8SPYEkWEzmFcR   v   |
10.37.17.238  Up 555 bytes
rIuuq3AR4DVK989X   |   ^
10.37.17.242  Up 555 bytes
smebTmIvQBMG56Zf   v   |
10.37.17.243  Up 555 bytes
tWTYyiqAKQVw7197   |   ^
10.37.17.232  Up 830 bytes
uVdBQkR9Dszm5deK   v   |
10.37.17.239  Up 555 bytes
xXQkDQn1vvg8e1xS   |   ^
10.37.17.240  Up 555 bytes
yQRrq9RG2dUsHUyR   |-->|


-- 
Bingbing Liu

Web and Mobile Data Management lab

Renmin University  of  China


Re: how to store file in the cassandra?

2010-04-25 Thread 刘兵兵
sorry i'm not very familiar with python, are you meaning that the files are
stored in the file system of the os?

then , the cassandra just stores the path to access the files?


On Mon, Apr 26, 2010 at 8:57 AM, Shuge Lee  wrote:

> In Python:
>
> keyspace.columnfamily[key][column] = value
>
> files.video[uuid.uuid4()]['name'] = 'foo.flv'
> files.video[uuid.uuid4()]['path'] = '/var/files/foo.flv'
>
> create a mapping
> files.video =  {
> uuid.uuid4() : {
> 'name' : 'foo.flv',
> 'path' : '/var/files/foo.flv',
> }
> }
>
> if most of sizes >= 0.5MB, use sys-fs/reiser4progs, else use ext4.
>
>
> 2010/4/26 Bingbing Liu 
>
>  any suggestion?
>>
>> 2010-04-26
>> --
>> Bingbing Liu
>>
>
>
>
> --
> Shuge Lee | Lee Li | 李蠡
>



-- 
Bingbing Liu

Web and Mobile Data Management lab

Renmin University  of  China


Re: how to store file in the cassandra?

2010-04-25 Thread Shuge Lee
In Python:

keyspace.columnfamily[key][column] = value

files.video[uuid.uuid4()]['name'] = 'foo.flv'
files.video[uuid.uuid4()]['path'] = '/var/files/foo.flv'

create a mapping
files.video =  {
uuid.uuid4() : {
'name' : 'foo.flv',
'path' : '/var/files/foo.flv',
}
}

if most of sizes >= 0.5MB, use sys-fs/reiser4progs, else use ext4.


2010/4/26 Bingbing Liu 

>  any suggestion?
>
> 2010-04-26
> --
> Bingbing Liu
>



-- 
Shuge Lee | Lee Li | 李蠡


how to store file in the cassandra?

2010-04-25 Thread Bingbing Liu
any suggestion?

2010-04-26 



Bingbing Liu 


Re: Question about TimeUUIDType

2010-04-25 Thread Jonathan Ellis
On Sun, Apr 25, 2010 at 5:40 PM, Tatu Saloranta  wrote:
>> Now with TimeUUIDType, if two UUID have the same timestamps, they are ordered
>> by bytes order.
>
> Naively for the whole UUID? That would not be good, given that
> timestamp within UUID is not stored in expected lexical order, but
> with sort of little-endian mess (first bytes are least-significant
> bytes of timestamp).

I think the code here is clearer than explaining in English. :)

comparing timeuuids o1 and o2:

long t1 = LexicalUUIDType.getUUID(o1).timestamp();
long t2 = LexicalUUIDType.getUUID(o2).timestamp();
return t1 < t2 ? -1 : (t1 > t2 ? 1 :
FBUtilities.compareByteArrays(o1, o2));

-Jonathan


Re: Question about TimeUUIDType

2010-04-25 Thread Tatu Saloranta
On Sat, Apr 24, 2010 at 2:08 AM, Sylvain Lebresne  wrote:
> On Sat, Apr 24, 2010 at 12:53 AM, Jesse McConnell
>  wrote:
>> try LexicalUUIDType, that will distinguish the secs correctly
>>
>> imo based on the existing impl (last I checked at least) TimeUUIDType
>> was equivalent to LongType
>
> It uses to be true that the TimeUUIDType comparator was only comparing the
> timestamps of the UUIDs. But this was more a bug than anything else
> since it made
> different UUID collide and was fixed for 0.6
> (https://issues.apache.org/jira/browse/CASSANDRA-907).
>
> Now with TimeUUIDType, if two UUID have the same timestamps, they are ordered
> by bytes order.

Naively for the whole UUID? That would not be good, given that
timestamp within UUID is not stored in expected lexical order, but
with sort of little-endian mess (first bytes are least-significant
bytes of timestamp).

-+ Tatu +-


Re: Lucandra - Lucene/Solr on Cassandra: April 26, NYC

2010-04-25 Thread Utku Can Topçu
Can you please release the talk at a place after it's been done?

Best Regards,
Utku

On Thu, Apr 22, 2010 at 6:51 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello folks,
>
> Those of you in or near NYC and using Lucene or Solr should come to
> "Lucandra - a Cassandra-based backend for Lucene and Solr" on April 26th:
>
> http://www.meetup.com/NYC-Search-and-Discovery/calendar/12979971/
>
> The presenter will be Lucandra's author, Jake Luciani.
>
> Please spread the word.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>


range get over subcolumns on supercolumn family

2010-04-25 Thread Rafael Ribeiro
Hi all!

 I am trying to do a paginated query on the subcolumns of a superfamily
column but sincerely I am a little bit confused.
 I have already been able to do a range query but only over the keys of a
regular column family.
 For the keys case I've been able to do so using the code below:

KeyRange keyRange = new KeyRange(count);
keyRange.setStart_key(startKey);
keyRange.setEnd_key("");

SliceRange range = new SliceRange();
range.setStart(new byte[] {});
range.setFinish(new byte[] {});

SlicePredicate predicate = new SlicePredicate();
predicate.setSlice_range(range);

ColumnParent cp = new ColumnParent("ColumnFamily");

List keySlices = client.get_range_slices("Keyspace",
cp, predicate, keyRange, ConsistencyLevel.ALL);

 Is there any way I can do a similar approach to do the range query on the
subcolumns? Would I need to do some trick over ColumnParent? I tried setting
the supercolumn attribute but with no success (sincerely I knew it wont work
but it was worth trying). Only to clarify a little bit... I am still
exercising what is possible to do with Cassandra and I was willing to store
a key over a supercolumnfamily with uuid keys under it so I could scan it
using an ordering scheme but without loading the whole data under the top
level key.

best regards,
Rafael Ribeiro


Re: value size, is there a suggested limit?

2010-04-25 Thread Mark Greene
http://wiki.apache.org/cassandra/CassandraLimitations

On Sun, Apr 25, 2010 at 4:19 PM, S Ahmed  wrote:

> Is there a suggested sized maximum that you can set the value of a given
> key?
>
> e.g. could I convert a document to bytes and store it as a value to a key?
>  if yes, which I presume so, what if the file is 10mb? or 100mb?
>


value size, is there a suggested limit?

2010-04-25 Thread S Ahmed
Is there a suggested sized maximum that you can set the value of a given
key?

e.g. could I convert a document to bytes and store it as a value to a key?
 if yes, which I presume so, what if the file is 10mb? or 100mb?


RE: newbie question on how columns names are i ndexed/lucene limitations?

2010-04-25 Thread Stu Hood
The indexes within rows are _not_ implemented with Lucene: there is a custom 
index structure that allows for random access within a row. But, you should 
probably read http://wiki.apache.org/cassandra/CassandraLimitations to 
understand the current limitations of the file format, some of which are 
scheduled to be fixed soon.

-Original Message-
From: "TuX RaceR" 
Sent: Sunday, April 25, 2010 11:54am
To: user@cassandra.apache.org
Subject: newbie question on how columns names are indexed/lucene limitations?

Hello Cassandra Users,

When use the RandomPartinionner and a simple ColumnFamily/Columns (i.e. 
no SuperColumns) my understanding is that one signle Row can store 
millions of columns.

If I look at the http://wiki.apache.org/cassandra/API, I understand that 
I can get a subset of the millions of columns defined above using:
SlicePredicate->ColumnNames or SlicePredicate->SliceRange

My question is about the implementation of this columns 'selection'.
I vaguely remember reading somewhere (but I cannot find the link again) 
that this was implemented using a Lucene index over the column names for 
each row.
Is that true? Is there a small lucene index per row?

Also we know from that lucene have some limitations 
http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations : you 
cannot index more than 2.1 billions documents as a document ID is mapped 
to a 32 bits int.

As I plan to store in column names the ID of my cassandra documents (the 
global number of documents can go well beyond 2.1 billions), will I be 
hit by the lucene limitations? I.e can I store cassandra documents ID 
(i.e keys) in column names, if in each individual row there are no more 
than few millions of those IDs? I guess the answer is "yes I can", 
because lucandra uses a similar schema but it is not clear for me why. 
Is that because the lucene index is made on each row and what really 
matters in the number of columns in one single row and not the number of 
distinct column names (globally over all the rows)?


Thanks in advance
TuX




Re: Cassandra - Thread leak when high concurrent load

2010-04-25 Thread Brandon Williams
On Sun, Apr 25, 2010 at 12:09 PM, JKnight JKnight wrote:

> Thanks Robson,
>
> The number of thread gradually increase to 7000. And the server hang up.
> I know threadpool is used to prevent creating large number of thread.
>
> So why Cassandra create large number of thread when high concurrent load.


The only thing I can surmise from this is that you are creating nearly 7000
client connections to Cassandra.  It is thread-per-connection, and a normal
amount of threads is in the low hundreds.

-Brandon


How do you construct an index and use it, especially in Ruby

2010-04-25 Thread Bob Hutchison

Hi,

I'm new to Cassandra and trying to work out how to do something that I've 
implemented any number of times (e.g. TokyoCabinet, Perst, even the filesystem 
using grep :-) I've managed to get some of this working in Cassandra but not 
all.

So here's the core of the situation.

I have this opaque chunk of data that I want to store in Cassandra and then 
find it again.

I can generate a key when the data is created very easily, and I've stored it 
in a straight forward manner: in a column with a key whose value is the data. 
And I can retrieve it when I know the key. No difficulties here at all, works 
fine.

Now I want to index this data taking what I imagine to be a pretty typical 
approach.

Lets say there's two many-to-one indexes: 'colour', and 'size'. Each colour 
value will have more than one chunk of data, same for size.

What I thought I'd do is make a super column and index the chunk of data kind 
of like: { 'colour' => { 'blue' => 1 }, 'size' => { 'large' => 1}} with the key 
equal to the key of the chunk of data. And Cassandra stores it without error 
like that. So using the Ruby gem, it'd be something along the lines of:

  cassandra.insert(:Indexes, key-of-the-chunk-of-data, { 'colour' => { 'blue' 
=> 1 }, 'size' => { 'large' => 1 } })

Q1: is this a reasonable approach? It *seems* to be what I've read is supposed 
to be done. The 1 is meaningless. Anyway, it executes without error in Ruby.

Q2: what is the syntax of the (Ruby) query to find the keys of all 'blue' 
chunks of data? I'm assuming get_range is the correct method, but what are the 
parameters? The docs say: get_range(column_family, options={}) but that seems 
to be missing a bit of detail, in particular the super column name.

Q2a: So I know there's a :start and :finish key supported in the options hash, 
inclusive, exclusive respectively. How do you define a range for equals with a 
UTF8 key? Surely not 'blue'.succ?? or by some kind of suffix??

Q2b: How do you specify the super column name 'colour'? Looking at the (Ruby) 
source of the get_range method and I'm unconvinced that this is implemented 
(seems to be a constant '' used where the super column name makes sense to be.)

Anyway I ended up hacking at the Ruby gem's source to use the column name where 
the '' was in the original, and didn't really get anywhere useful (I can find 
nothing, or everything, nothing in between).

Q3: If I am correct about what is supposed to be done, does the Ruby gem 
support it?

Q4: Does anyone know of some Ruby code that does and indexed lookup that they 
could point me at. (lots of code that indexes but nothing that searches by the 
index)

I'll try to take a look at some of the other Cassandra client implementations 
and see if I can get this model to work. Maybe just a Ruby problem?? With any 
luck, it'll be me messing up.

If it'd help I can post the source of what I have, but it'll need some cleanup. 
Let me know.

Thanks for taking the time to read this far :-)

Bob


Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://xampl.com/so



Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://xampl.com/so






Re: Cassandra - Thread leak when high concurrent load

2010-04-25 Thread JKnight JKnight
Thanks Robson,

The number of thread gradually increase to 7000. And the server hang up.
I know threadpool is used to prevent creating large number of thread.

So why Cassandra create large number of thread when high concurrent load.

On Sun, Apr 25, 2010 at 5:38 PM, Mark Robson  wrote:

>
>
> On 25 April 2010 10:48, JKnight JKnight  wrote:
>
>> Dear all,
>>
>> My Cassandra server had thread leak when high concurrent load. I used
>> jconsole and saw many, many thread occur.
>>
>
> Just because there are a lot of threads, need not imply a thread leak.
> Cassandra uses a lot of threads.
>
> Do you see the number of threads gradually increase during a soak test on
> your test cluster? Can you dump the JVM info (I believe sending a signal
> makes it dump this)?
>
> The JMX information viewable via JConsole also gives you a list of threads.
> Assuming you see this gradually increasing during the soak test, can you
> tellwhich thread pool is increasing?
>
> Mark
>



-- 
Best regards,
JKnight


newbie question on how columns names are indexed/lucene limitations?

2010-04-25 Thread TuX RaceR

Hello Cassandra Users,

When use the RandomPartinionner and a simple ColumnFamily/Columns (i.e. 
no SuperColumns) my understanding is that one signle Row can store 
millions of columns.


If I look at the http://wiki.apache.org/cassandra/API, I understand that 
I can get a subset of the millions of columns defined above using:

SlicePredicate->ColumnNames or SlicePredicate->SliceRange

My question is about the implementation of this columns 'selection'.
I vaguely remember reading somewhere (but I cannot find the link again) 
that this was implemented using a Lucene index over the column names for 
each row.

Is that true? Is there a small lucene index per row?

Also we know from that lucene have some limitations 
http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations : you 
cannot index more than 2.1 billions documents as a document ID is mapped 
to a 32 bits int.


As I plan to store in column names the ID of my cassandra documents (the 
global number of documents can go well beyond 2.1 billions), will I be 
hit by the lucene limitations? I.e can I store cassandra documents ID 
(i.e keys) in column names, if in each individual row there are no more 
than few millions of those IDs? I guess the answer is "yes I can", 
because lucandra uses a similar schema but it is not clear for me why. 
Is that because the lucene index is made on each row and what really 
matters in the number of columns in one single row and not the number of 
distinct column names (globally over all the rows)?



Thanks in advance
TuX


Re: The Difference Between Cassandra and HBase

2010-04-25 Thread Joseph Stein
it is kind of the classic distinction between OLTP & OLAP.

Cassandra is to OLTP as HBase is to OLAP (for those SAT nutz).

Both are useful and valuable in their own right, agreed.

On Sun, Apr 25, 2010 at 12:20 PM, Jeff Hodges  wrote:
> HBase is awesome when you need high throughput and don't care so much
> about latency. Cassandra is generally the opposite. They are
> wonderfully complementary.
> --
> Jeff
>
> On Sun, Apr 25, 2010 at 8:19 AM, Lenin Gali  wrote:
>> I second Joe.
>>
>> Lenin
>> Sent from my BlackBerry® wireless handheld
>>
>> -Original Message-
>> From: Joe Stump 
>> Date: Sun, 25 Apr 2010 13:04:50
>> To: 
>> Subject: Re: The Difference Between Cassandra and HBase
>>
>>
>> On Apr 25, 2010, at 11:40 AM, Mark Robson wrote:
>>
>>> For me an important difference is that Cassandra is operationally much more 
>>> straightforward - there is only one type of node, and it is fully redundant 
>>> (depending what consistency level you're using).
>>>
>>> This seems to be an advantage in Cassandra vs most other distributed 
>>> storage systems, which almost all seem to require some "master" nodes which 
>>> have different operational requirements (e.g. cannot fail, need to be 
>>> failed over manually or have another HA solution installed for them)
>>
>> These two remain the #1 and #2 reasons I recommend Cassandra over HBase. At 
>> the end of the day, Cassandra is an *absolute* dream to manage across 
>> multiple data centers. I could go on and on about the voodoo that is 
>> expanding, contracting, and rebalancing a Cassandra cluster. It's pretty 
>> awesome.
>>
>> That being said, we're getting ready to spin up an HBase cluster. If you're 
>> wanting increment/decrement, more complex range scans, etc. then HBase is a 
>> great candidate. Especially if you don't need it to span multiple data 
>> centers. We're using Cassandra for our main things, and then HBase+Hive for 
>> analytics.
>>
>> There's room for both. Especially if you're using Hadoop with Cassandra.
>>
>> --Joe
>>
>>
>



-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/


Re: The Difference Between Cassandra and HBase

2010-04-25 Thread Joe Stump

On Apr 25, 2010, at 5:18 PM, Eric Hauser wrote:

> Out of curiosity, are you planning on copying the data you store in 
> HBase/Hive into separate Hadoop cluster in a different data center or backing 
> up HDFS in some other manner?  Redundancy isn't an issue within the cluster; 
> it's more a concern of storing all your HDFS data in one physical location.

We'll eventually move to this. For the near-term we'll be routing traffic using 
HBase to a single data center.

--Joe



Re: The Difference Between Cassandra and HBase

2010-04-25 Thread Jeff Hodges
HBase is awesome when you need high throughput and don't care so much
about latency. Cassandra is generally the opposite. They are
wonderfully complementary.
--
Jeff

On Sun, Apr 25, 2010 at 8:19 AM, Lenin Gali  wrote:
> I second Joe.
>
> Lenin
> Sent from my BlackBerry® wireless handheld
>
> -Original Message-
> From: Joe Stump 
> Date: Sun, 25 Apr 2010 13:04:50
> To: 
> Subject: Re: The Difference Between Cassandra and HBase
>
>
> On Apr 25, 2010, at 11:40 AM, Mark Robson wrote:
>
>> For me an important difference is that Cassandra is operationally much more 
>> straightforward - there is only one type of node, and it is fully redundant 
>> (depending what consistency level you're using).
>>
>> This seems to be an advantage in Cassandra vs most other distributed storage 
>> systems, which almost all seem to require some "master" nodes which have 
>> different operational requirements (e.g. cannot fail, need to be failed over 
>> manually or have another HA solution installed for them)
>
> These two remain the #1 and #2 reasons I recommend Cassandra over HBase. At 
> the end of the day, Cassandra is an *absolute* dream to manage across 
> multiple data centers. I could go on and on about the voodoo that is 
> expanding, contracting, and rebalancing a Cassandra cluster. It's pretty 
> awesome.
>
> That being said, we're getting ready to spin up an HBase cluster. If you're 
> wanting increment/decrement, more complex range scans, etc. then HBase is a 
> great candidate. Especially if you don't need it to span multiple data 
> centers. We're using Cassandra for our main things, and then HBase+Hive for 
> analytics.
>
> There's room for both. Especially if you're using Hadoop with Cassandra.
>
> --Joe
>
>


Re: The Difference Between Cassandra and HBase

2010-04-25 Thread Eric Hauser
Out of curiosity, are you planning on copying the data you store in
HBase/Hive into separate Hadoop cluster in a different data center or
backing up HDFS in some other manner?  Redundancy isn't an issue within the
cluster; it's more a concern of storing all your HDFS data in one physical
location.


On Sun, Apr 25, 2010 at 8:04 AM, Joe Stump  wrote:

>
> On Apr 25, 2010, at 11:40 AM, Mark Robson wrote:
>
> > For me an important difference is that Cassandra is operationally much
> more straightforward - there is only one type of node, and it is fully
> redundant (depending what consistency level you're using).
> >
> > This seems to be an advantage in Cassandra vs most other distributed
> storage systems, which almost all seem to require some "master" nodes which
> have different operational requirements (e.g. cannot fail, need to be failed
> over manually or have another HA solution installed for them)
>
> These two remain the #1 and #2 reasons I recommend Cassandra over HBase. At
> the end of the day, Cassandra is an *absolute* dream to manage across
> multiple data centers. I could go on and on about the voodoo that is
> expanding, contracting, and rebalancing a Cassandra cluster. It's pretty
> awesome.
>
> That being said, we're getting ready to spin up an HBase cluster. If you're
> wanting increment/decrement, more complex range scans, etc. then HBase is a
> great candidate. Especially if you don't need it to span multiple data
> centers. We're using Cassandra for our main things, and then HBase+Hive for
> analytics.
>
> There's room for both. Especially if you're using Hadoop with Cassandra.
>
> --Joe
>
>


Re: The Difference Between Cassandra and HBase

2010-04-25 Thread Lenin Gali
I second Joe.

Lenin
Sent from my BlackBerry® wireless handheld

-Original Message-
From: Joe Stump 
Date: Sun, 25 Apr 2010 13:04:50 
To: 
Subject: Re: The Difference Between Cassandra and HBase


On Apr 25, 2010, at 11:40 AM, Mark Robson wrote:

> For me an important difference is that Cassandra is operationally much more 
> straightforward - there is only one type of node, and it is fully redundant 
> (depending what consistency level you're using).
> 
> This seems to be an advantage in Cassandra vs most other distributed storage 
> systems, which almost all seem to require some "master" nodes which have 
> different operational requirements (e.g. cannot fail, need to be failed over 
> manually or have another HA solution installed for them)

These two remain the #1 and #2 reasons I recommend Cassandra over HBase. At the 
end of the day, Cassandra is an *absolute* dream to manage across multiple data 
centers. I could go on and on about the voodoo that is expanding, contracting, 
and rebalancing a Cassandra cluster. It's pretty awesome.

That being said, we're getting ready to spin up an HBase cluster. If you're 
wanting increment/decrement, more complex range scans, etc. then HBase is a 
great candidate. Especially if you don't need it to span multiple data centers. 
We're using Cassandra for our main things, and then HBase+Hive for analytics. 

There's room for both. Especially if you're using Hadoop with Cassandra. 

--Joe



Re: Cassandra-cli tutorials

2010-04-25 Thread Roger Schildmeijer

On 25 apr 2010, at 15.15em, S Ahmed wrote:

> Ok excited I got it up and running on windows 7, yah!
> 
> Curious, are there any tutorials or examples of using the cassandra-cli?

http://wiki.apache.org/cassandra/CassandraCli

> 
> BTW, the cassandra-cli is pretty cool, even comes with tab-complete, is that 
> an OS thing or someone coded that feature up?  I'm going to dig into the code 
> for thisthanks!



Cassandra-cli tutorials

2010-04-25 Thread S Ahmed
Ok excited I got it up and running on windows 7, yah!

Curious, are there any tutorials or examples of using the cassandra-cli?

BTW, the cassandra-cli is pretty cool, even comes with tab-complete, is that
an OS thing or someone coded that feature up?  I'm going to dig into the
code for thisthanks!


Re: getting cassandra setup on windows 7

2010-04-25 Thread S Ahmed
great that worked thanks!

On Fri, Apr 23, 2010 at 2:28 PM, Mark Greene  wrote:

> Try the 
> cassandra-with-fixes.bat
>  file
> attached to the issue. I had the same issue an that bat file got cassandra
> to start. It still throws another error complaining about the
> log4j.properties.
>
>
> On Fri, Apr 23, 2010 at 1:59 PM, S Ahmed  wrote:
>
>> Any insights?
>>
>> Much appreciated!
>>
>>
>> On Thu, Apr 22, 2010 at 11:13 PM, S Ahmed  wrote:
>>
>>> I was just reading that thanks.
>>>
>>> What does he mean when he says:
>>>
>>> "This appears to be related to data storage paths I set, because if I
>>> switch the paths back to the default UNIX paths. Everything runs fine"
>>>
>>>
>>> On Thu, Apr 22, 2010 at 11:07 PM, Jonathan Ellis wrote:
>>>
 https://issues.apache.org/jira/browse/CASSANDRA-948

 On Thu, Apr 22, 2010 at 10:03 PM, S Ahmed  wrote:
 > Ok so I found the config section:
 >
 E:\java\cassandra\apache-cassandra-0.6.1-bin\apache-cassandra-0.6.1\commitlog
 >   
 >
 >
  
 E:\java\cassandra\apache-cassandra-0.6.1-bin\apache-cassandra-0.6.1\data
 >   
 >
 > Now when I run:
 > bin/cassandra
 > I get:
 > Starting cassandra server
 > listening for transport dt_socket at address:
 > exception in thread main java.lang.noclassDefFoundError:
 > org/apache/cassthreft/cassandraDaemon
 > could not find the main class:
 > org.apache.cassandra.threif.cassandraDaemon...
 >
 >
 >
 >
 >
 > On Thu, Apr 22, 2010 at 10:53 PM, S Ahmed 
 wrote:
 >>
 >> So I uncompressed the .tar, in the readme it says:
 >> * tar -zxvf cassandra-$VERSION.tgz
 >>   * cd cassandra-$VERSION
 >>   * sudo mkdir -p /var/log/cassandra
 >>   * sudo chown -R `whoami` /var/log/cassandra
 >>   * sudo mkdir -p /var/lib/cassandra
 >>   * sudo chown -R `whoami` /var/lib/cassandra
 >>
 >> My cassandra is at:
 >> c:\java\cassandra\apache-cassandra-0.6.1/
 >> So I have to create 2 folders log and lib?
 >> Is there a setting in a config file that I edit?
 >

>>>
>>>
>>
>


Re: The Difference Between Cassandra and HBase

2010-04-25 Thread Joe Stump

On Apr 25, 2010, at 11:40 AM, Mark Robson wrote:

> For me an important difference is that Cassandra is operationally much more 
> straightforward - there is only one type of node, and it is fully redundant 
> (depending what consistency level you're using).
> 
> This seems to be an advantage in Cassandra vs most other distributed storage 
> systems, which almost all seem to require some "master" nodes which have 
> different operational requirements (e.g. cannot fail, need to be failed over 
> manually or have another HA solution installed for them)

These two remain the #1 and #2 reasons I recommend Cassandra over HBase. At the 
end of the day, Cassandra is an *absolute* dream to manage across multiple data 
centers. I could go on and on about the voodoo that is expanding, contracting, 
and rebalancing a Cassandra cluster. It's pretty awesome.

That being said, we're getting ready to spin up an HBase cluster. If you're 
wanting increment/decrement, more complex range scans, etc. then HBase is a 
great candidate. Especially if you don't need it to span multiple data centers. 
We're using Cassandra for our main things, and then HBase+Hive for analytics. 

There's room for both. Especially if you're using Hadoop with Cassandra. 

--Joe



Re: tcp CLOSE_WAIT bug

2010-04-25 Thread yangfeng
I encountered the same problem! Hope to get some help.Tks.

2010/4/22 Ingram Chen 

> arh! That's right.
>
> I check OutboundTcpConnection and it only does closeSocket() after
> something went wrong. I will log more in OutboundTcpConnection to see what
> actually happens.
>
> Thank your help.
>
>
>
>
> On Thu, Apr 22, 2010 at 10:03, Jonathan Ellis  wrote:
>
>> But those connections aren't supposed to ever terminate unless a node
>> dies or is partitioned.  So if we "fix" it by adding a socket.close I
>> worry that we're covering up something more important.
>>
>> On Wed, Apr 21, 2010 at 8:53 PM, Ingram Chen 
>> wrote:
>> > I agree your point. I patch the code and log more informations to find
>> out
>> > the real cause.
>> >
>> > Here is the code snip I think may be the cause:
>> >
>> > IncomingTcpConnection:
>> >
>> > public void run()
>> > {
>> > while (true)
>> > {
>> > try
>> > {
>> > MessagingService.validateMagic(input.readInt());
>> > int header = input.readInt();
>> > int type = MessagingService.getBits(header, 1, 2);
>> > boolean isStream = MessagingService.getBits(header, 3,
>> 1) ==
>> > 1;
>> > int version = MessagingService.getBits(header, 15, 8);
>> >
>> > if (isStream)
>> > {
>> > new
>> IncomingStreamReader(socket.getChannel()).read();
>> > }
>> > else
>> > {
>> > int size = input.readInt();
>> > byte[] contentBytes = new byte[size];
>> > input.readFully(contentBytes);
>> >
>> MessagingService.getDeserializationExecutor().submit(new
>> > MessageDeserializationTask(new ByteArrayInputStream(contentBytes)));
>> > }
>> > }
>> > catch (EOFException e)
>> > {
>> > if (logger.isTraceEnabled())
>> > logger.trace("eof reading from socket; closing", e);
>> > break;
>> > }
>> > catch (IOException e)
>> > {
>> > if (logger.isDebugEnabled())
>> > logger.debug("error reading from socket; closing",
>> e);
>> > break;
>> > }
>> > }
>> > }
>> >
>> > In normal condition, while loop is terminated after input.readInt()
>> throw
>> > EOFException. but it quits without socket.close(). what I do is wrap
>> whole
>> > while block inside a try { ... } finally {socket.close();}
>> >
>> >
>> > On Thu, Apr 22, 2010 at 01:14, Jonathan Ellis 
>> wrote:
>> >>
>> >> I'd like to get something besides "I'm seeing close wait but i have no
>> >> idea why" for a bug report, since most people aren't seeing that.
>> >>
>> >> On Tue, Apr 20, 2010 at 9:33 AM, Ingram Chen 
>> wrote:
>> >> > I trace IncomingStreamReader source and found that incoming socket
>> comes
>> >> > from MessagingService$SocketThread.
>> >> > but there is no close() call on either accepted socket or
>> socketChannel.
>> >> >
>> >> > Should I file a bug report ?
>> >> >
>> >> > On Tue, Apr 20, 2010 at 11:02, Ingram Chen 
>> wrote:
>> >> >>
>> >> >> this happened after several hours of operations and both nodes are
>> >> >> started
>> >> >> at the same time (clean start without any data). so it might not
>> relate
>> >> >> to
>> >> >> Bootstrap.
>> >> >>
>> >> >> In system.log I do not see any logs like "xxx node dead" or
>> exceptions.
>> >> >> and both nodes in test are alive. they serve read/write well, too.
>> >> >> Below
>> >> >> four connections between nodes are keep healthy from time to time.
>> >> >>
>> >> >> tcp0  0 :::192.168.2.87:7000
>> >> >> :::192.168.2.88:58447   ESTABLISHED
>> >> >> tcp0  0 :::192.168.2.87:54986
>> >> >> :::192.168.2.88:7000ESTABLISHED
>> >> >> tcp0  0 :::192.168.2.87:59138
>> >> >> :::192.168.2.88:7000ESTABLISHED
>> >> >> tcp0  0 :::192.168.2.87:7000
>> >> >> :::192.168.2.88:39074   ESTABLISHED
>> >> >>
>> >> >> so connections end in CLOSE_WAIT should be newly created. (for
>> >> >> streaming
>> >> >> ?) This seems related to streaming issues we suffered recently:
>> >> >>
>> http://n2.nabble.com/busy-thread-on-IncomingStreamReader-td4908640.html
>> >> >>
>> >> >> I would like add some debug codes around opening and closing of
>> socket
>> >> >> to
>> >> >> find out what happend.
>> >> >>
>> >> >> Could you give me some hint, about what classes I should take look ?
>> >> >>
>> >> >>
>> >> >> On Tue, Apr 20, 2010 at 04:47, Jonathan Ellis 
>> >> >> wrote:
>> >> >>>
>> >> >>> Is this after doing a bootstrap or other streaming operation?  Or
>> did
>> >> >>> a node go down?
>> >> >>>
>> >> >>> The internal sockets are supposed to remain open, otherwise.
>> >> >>>
>> >> >>> On Mon, Apr 19, 2010 at 10:56 AM, Ingram Chen <
>> ingramc...@gmail.com>
>> >

Re: The Difference Between Cassandra and HBase

2010-04-25 Thread Mark Robson
For me an important difference is that Cassandra is operationally much more
straightforward - there is only one type of node, and it is fully redundant
(depending what consistency level you're using).

This seems to be an advantage in Cassandra vs most other distributed storage
systems, which almost all seem to require some "master" nodes which have
different operational requirements (e.g. cannot fail, need to be failed over
manually or have another HA solution installed for them)

Mark


Re: Cassandra - Thread leak when high concurrent load

2010-04-25 Thread Mark Robson
On 25 April 2010 10:48, JKnight JKnight  wrote:

> Dear all,
>
> My Cassandra server had thread leak when high concurrent load. I used
> jconsole and saw many, many thread occur.
>

Just because there are a lot of threads, need not imply a thread leak.
Cassandra uses a lot of threads.

Do you see the number of threads gradually increase during a soak test on
your test cluster? Can you dump the JVM info (I believe sending a signal
makes it dump this)?

The JMX information viewable via JConsole also gives you a list of threads.
Assuming you see this gradually increasing during the soak test, can you
tellwhich thread pool is increasing?

Mark


Cassandra - Thread leak when high concurrent load

2010-04-25 Thread JKnight JKnight
Dear all,

My Cassandra server had thread leak when high concurrent load. I used
jconsole and saw many, many thread occur.

I knew Cassandra use TThreadPoolServer for handling request. And Cassandra
use DebuggableThreadPoolExecutor to handling command (read/write).

I want to know the reason of thread leak problem.

Anybody can help me?

Thank a lot for support.
-- 
Best regards,
JKnight