Re: Data Model Question

2012-01-21 Thread Edward Capriolo
On Sat, Jan 21, 2012 at 7:49 PM, Jean-Nicolas Boulay Desjardins <
jnbdzjn...@gmail.com> wrote:

> Milind Parikh, Rainbird is back by Twitter... My worry is that you
> might not be around in the future... Also, do you have evidence that
> your system is better? Because Rainbird is used by Twitter.
>
> On Sat, Jan 21, 2012 at 6:55 PM, Milind Parikh 
> wrote:
> >
> > I used rainbird as inspiration for Countandra (& some of publicly
> available data structures from rainbird preso). That said, there are
> significant differences between the two architectures. Additiomally as
> Cassandra begins to provide triggets, some very interesting things will
> become possible in Countandra.
> > Hth
> > Regards
> > Milind
> >
> > /***
> > sent from my android...please pardon occasional typos as I respond @ the
> speed of thought
> > /
> >
> > On Jan 21, 2012 3:37 PM, "Jean-Nicolas Boulay Desjardins" <
> jnbdzjn...@gmail.com> wrote:
> >
> > But What about: Rainbird?
> >
> >
> > On Sat, Jan 21, 2012 at 10:52 AM, R. Verlangen  wrote:
> > >
> > > A couple of days ago I cam...
>

Rainbird has never been open sourced. So even if rainbird was better or
worse it would not matter because no one outside of twitter can use it.

Other options: (aka do it yourself with)
http://docs.s4.io/manual/joining_streams.html
http://peregrine_mapreduce.bitbucket.org/
http://www.slideshare.net/nathanmarz/storm-distributed-and-faulttolerant-realtime-computation

I have not tried countsandra but if you want proof of it's fitness you
should download it and try it out.


Re: Data Model Question

2012-01-21 Thread Tamar Fraenkel
Hi
It may be my lack of knowledge but both has to do with counting, which is not 
what I need. 
What is wrong with the two models I suggested?
Tamar

Sent from my iPod

On Jan 22, 2012, at 2:49 AM, Jean-Nicolas Boulay Desjardins 
 wrote:

> Milind Parikh, Rainbird is back by Twitter... My worry is that you
> might not be around in the future... Also, do you have evidence that
> your system is better? Because Rainbird is used by Twitter.
> 
> On Sat, Jan 21, 2012 at 6:55 PM, Milind Parikh  wrote:
>> 
>> I used rainbird as inspiration for Countandra (& some of publicly available 
>> data structures from rainbird preso). That said, there are significant 
>> differences between the two architectures. Additiomally as Cassandra begins 
>> to provide triggets, some very interesting things will become possible in 
>> Countandra.
>> Hth
>> Regards
>> Milind
>> 
>> /***
>> sent from my android...please pardon occasional typos as I respond @ the 
>> speed of thought
>> /
>> 
>> On Jan 21, 2012 3:37 PM, "Jean-Nicolas Boulay Desjardins" 
>>  wrote:
>> 
>> But What about: Rainbird?
>> 
>> 
>> On Sat, Jan 21, 2012 at 10:52 AM, R. Verlangen  wrote:
>>> 
>>> A couple of days ago I cam...


Re: Data Model Question

2012-01-21 Thread Jean-Nicolas Boulay Desjardins
Milind Parikh, Rainbird is back by Twitter... My worry is that you
might not be around in the future... Also, do you have evidence that
your system is better? Because Rainbird is used by Twitter.

On Sat, Jan 21, 2012 at 6:55 PM, Milind Parikh  wrote:
>
> I used rainbird as inspiration for Countandra (& some of publicly available 
> data structures from rainbird preso). That said, there are significant 
> differences between the two architectures. Additiomally as Cassandra begins 
> to provide triggets, some very interesting things will become possible in 
> Countandra.
> Hth
> Regards
> Milind
>
> /***
> sent from my android...please pardon occasional typos as I respond @ the 
> speed of thought
> /
>
> On Jan 21, 2012 3:37 PM, "Jean-Nicolas Boulay Desjardins" 
>  wrote:
>
> But What about: Rainbird?
>
>
> On Sat, Jan 21, 2012 at 10:52 AM, R. Verlangen  wrote:
> >
> > A couple of days ago I cam...


Re: ideal cluster size

2012-01-21 Thread Thorsten von Eicken
Good point. One thing I'm wondering about cassandra is what happens when
there is a massive failure. For example, if 1/3 of the nodes go down or
become unreachable. This could happen in EC2 if an AZ has a failure, or
in a datacenter if a whole rack or UPS goes dark. I'm not so concerned
about the time where the nodes are down. If I understand replication,
consistency, ring, and such I can architect things such that what must
continue running does continue.

What I'm concerned about is when these nodes all come back up or
reconnect. I have a hard time figuring out what exactly happens other
than the fact that hinted handoffs get processed. Are the restarted
nodes handling reads during that time? If so, they could serve up
massive amounts of stale data, no? Do they then all start a repair, or
is this something that needs to be run manually? If many do a repair at
the same time, do I effectively end up with a down cluster due to the
repair load? If no node was lost, is a repair required or are the hinted
handoffs sufficient?

Is there a manual or wiki section that discusses some of this and I just
missed it?

On 1/21/2012 2:25 PM, Peter Schuller wrote:
>> Thanks for the responses! We'll definitely go for powerful servers to
>> reduce the total count. Beyond a dozen servers there really doesn't seem
>> to be much point in trying to increase count anymore for
> Just be aware that if "big" servers imply *lots* of data (especially
> in relation to memory size), it's not necessarily the best trade-off.
> Consider the time it takes to do repairs, streaming, node start-up,
> etc.
>
> If it's only about CPU resources then bigger nodes probably make more
> sense if the h/w is cost effective.
>


Re: Data Model Question

2012-01-21 Thread Milind Parikh
I used rainbird as inspiration for Countandra (& some of publicly available
data structures from rainbird preso). That said, there are significant
differences between the two architectures. Additiomally as Cassandra begins
to provide triggets, some very interesting things will become possible in
Countandra.
Hth
Regards
Milind

/***
sent from my android...please pardon occasional typos as I respond @ the
speed of thought
/

On Jan 21, 2012 3:37 PM, "Jean-Nicolas Boulay Desjardins" <
jnbdzjn...@gmail.com> wrote:

But What about: Rainbird?


On Sat, Jan 21, 2012 at 10:52 AM, R. Verlangen  wrote:
>
> A couple of days ago I cam...


Re: Data Model Question

2012-01-21 Thread Jean-Nicolas Boulay Desjardins
But What about: Rainbird?

On Sat, Jan 21, 2012 at 10:52 AM, R. Verlangen  wrote:
>
> A couple of days ago I came across Countandra ( http://countandra.org/ ). It 
> seems that it might be a solution for you.
>
> Gr. Robin
>
>
> 2012/1/20 Tamar Fraenkel 
>>
>> Hi!
>>
>> I am a newbie to Cassandra and seeking some advice regarding the data model 
>> I should use to best address my needs.
>>
>> For simplicity, what I want to accomplish is:
>>
>> I have a system that has users (potentially ~10,000 per day) and they 
>> perform actions in the system (total of ~50,000 a day).
>>
>> Each User’s action is taking place in a certain point in time, and is also 
>> classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s 
>> Categories and Tags has a score associated with it, the score is between 0 
>> to 1 (let’s assume precision of 0.0001).
>>
>> I want to be able to identify similar actions in the system (performed 
>> usually by more than one user). Similarity of actions is calculated based on 
>> their common Categories and Tags taking scores into account.
>>
>> I need the system to store:
>>
>> The list of my users with attributes like name, age etc
>> For each action – the categories and tags associated with it and their 
>> score, the time of the action, and the user who performed it.
>> Groups of similar actions (ActionGroups) – the id’s of actions in the group, 
>> the categories and tags describing the group, with their scores. Those are 
>> calculated using an algorithm that takes into account the categories and 
>> tags of the actions in the group.
>>
>> When a user performs a new action in the system, I want to add it to a 
>> fitting ActionGroups (with similar categories and tags).
>>
>> For this I need to be able to perform the following:
>>
>> Find all the recent ActionGroups (those who were updated with actions 
>> performed during the last T minutes), who has at list one of the new 
>> action’s categories AND at list one of the new action’s tags.
>>
>>
>>
>> I thought of two ways to address the issue and I would appreciate your 
>> insights.
>>
>>
>>
>> First one using secondary indexes
>>
>> Column Family: Users
>>
>> Key: userId
>>
>> Compare with Bytes Type
>>
>> Columns: name: <>, age: <> etc…
>>
>>
>>
>> Column Family: Actions
>>
>> Key: actionId
>>
>> Compare with Bytes Type
>>
>> Columns:  Category1 :  ….
>>
>>           CategoriN: ,
>>
>>           Tag1 : , ….
>>
>>           TagK:
>>
>>           Time: timestamp
>>
>>           user: userId
>>
>>
>>
>> Column Family: ActionGroups
>>
>> Key: actionGroupId
>>
>> Compare with Bytes Type
>>
>> Columns: Category1 :  ….
>>
>>          CategoriN: ,
>>
>>          Tag1 :  ….
>>
>>          TagK:
>>
>>          lastUpdateTime: timestamp
>>
>>          actionId1: null, … ,
>>
>>          actionIdM: null
>>
>>
>>
>> I will then define secondary index on each tag columns, category columns, 
>> and the update time column.
>>
>> Let’s assume the new action I want to add to ActionGroup has 
>> NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 – 
>> NewActionTagN. I will perform the following query:
>>
>> Select  * From ActionGroups where
>>
>>    (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
>>
>>    (NewActionTag1 > 0  … or NewActionTagN > 0) and
>>
>>    lastUpdateTime > T;
>>
>>
>>
>> Second solution
>>
>> Have the same CF as in the first solution without the secondary index , and 
>> have two additional CF-ies:
>>
>> Column Family: CategoriesToActionGroupId
>>
>> Key: categoryId
>>
>> Compare with ByteType
>>
>> Columns: {Timestamp, ActionGroupsId1 } : null
>>
>>          {Timestamp, ActionGroupsId2} : null
>>
>>          ...
>>
>> *timestamp is the update time for the ActionGroup
>>
>>
>>
>> A similar CF will be defined for tags.
>>
>>
>>
>> I will then be able to run several queries on CategoriesToActionGroupId (one 
>> for each of the new story Categories), with column slice for the right 
>> update time of the ActionGroup.
>>
>> I will do the same for the TagsToActionGroupId.
>>
>> I will then use my client code to remove duplicates (ActionGroups who are 
>> associated with more than one Tag or Category).
>>
>>
>>
>> My questions are:
>>
>> Are the two solutions viable? If yes, which is better
>> Is there any better way of doing this?
>> Can I use jdbc and CQL with both method, or do I have to use Hector (I am 
>> using Java).
>>
>> Thanks
>>
>> Tamar
>>
>>
>>
>>
>
>


Re: ideal cluster size

2012-01-21 Thread Peter Schuller
> Thanks for the responses! We'll definitely go for powerful servers to
> reduce the total count. Beyond a dozen servers there really doesn't seem
> to be much point in trying to increase count anymore for

Just be aware that if "big" servers imply *lots* of data (especially
in relation to memory size), it's not necessarily the best trade-off.
Consider the time it takes to do repairs, streaming, node start-up,
etc.

If it's only about CPU resources then bigger nodes probably make more
sense if the h/w is cost effective.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: ideal cluster size

2012-01-21 Thread Thorsten von Eicken
Thanks for the responses! We'll definitely go for powerful servers to
reduce the total count. Beyond a dozen servers there really doesn't seem
to be much point in trying to increase count anymore for
replication/redundancy. I'm assuming we will use level compaction, which
means that we'll most likely run out of cpu before we run out of I/O. At
least that has been my experience so far. I'm glad to hear that 100+
nodes isn't that unusual anymore in the cassandra world.

On 1/21/2012 3:38 AM, Eric Czech wrote:
> I'd also add that one of the biggest complications to arise from
> having multiple clusters is that read biased client applications would
> need to be aware of all clusters and either aggregate result sets or
> involve logic to choose the right cluster based on a particular query.
>
> And from a more operational perspective, I think you'd have a tough
> time find monitoring applications (like Opscenter) that would support
> multiple clusters within the same viewport.  Having used multiple
> clusters in the past, I can definitely tell you that from an
> administrative, operational, and development standpoint, one cluster
> is almost definitely better than many. 
>
> Oh and I'm positive that there are other cassandra deployments out
> there with well beyond 100 nodes so I don't thinking you're really
> treading on dangerous ground here.
>
> I'd definitely say that you should try to use a single cluster if
> possible.
>
> On Fri, Jan 20, 2012 at 9:34 PM, Maxim Potekhin  > wrote:
>
> You can also scale not "horizontally" but "diagonally",
> i.e. raid SSDs and have multicore CPUs. This means that
> you'll have same performance with less nodes, making
> it far easier to manage.
>
> SSDs by themselves will give you an order of magnitude
> improvement on I/O.
>
>
>
> On 1/19/2012 9:17 PM, Thorsten von Eicken wrote:
>
> We're embarking on a project where we estimate we will need on
> the order
> of 100 cassandra nodes. The data set is perfectly
> partitionable, meaning
> we have no queries that need to have access to all the data at
> once. We
> expect to run with RF=2 or =3. Is there some notion of ideal
> cluster
> size? Or perhaps asked differently, would it be easier to run
> one large
> cluster or would it be easier to run a bunch of, say, 16 node
> clusters?
> Everything we've done to date has fit into 4-5 node clusters.
>
>
>


Re: Unbalanced cluster with RandomPartitioner

2012-01-21 Thread Marcel Steinbach
I thought about our issue again and was thinking, maybe the describeOwnership 
should take into account, if a token is outside the partitioners maximum token 
range?

To recap our problem: we had tokens, that were apart by 12.5% of the token 
range 2**127, however, we had an offset on each token, which moved the 
cluster's token range above 2**127. That resulted in two nodes getting almost 
none or none primary replicas. 

Afaik, the partitioner itself describes the key ownership in the ring, but it 
didn't take into account that we left its maximum key range. 

Of course, it  is silly and not very likely that users make that mistake, 
however, we did it, and it took me quite some time to figure that out (maybe 
also because it wasn't me that setup the cluster). 

To carry it to the extreme, you could construct a cluster of  n nodes with all 
tokens greater than 2**127, the ownership description would show a ownership of 
1/n each but all data would go to the node with the lowest token (given RP and 
RF=1).

I think it is wrong to calculate the ownership by subtracting the previous 
token from the current token and divide it by the maximum token without 
acknowledging we already might be "out of bounds". 

Cheers 
Marcel

On 20.01.2012, at 16:28, Marcel Steinbach wrote:

> Thanks for all the responses!
> 
> I found our problem:
> Using the Random Partitioner, the key range is from 0..2**127.When we added 
> nodes, we generated the keys and out of convenience, we added an offset to 
> the tokens because the move was easier like that.
> 
> However, we did not execute the modulo 2**127 for the last two tokens, so 
> they were outside the RP's key range. 
> moving the last two tokens to their mod 2**127 will resolve the problem.
> 
> Cheers,
> Marcel
> 
> On 20.01.2012, at 10:32, Marcel Steinbach wrote:
> 
>> On 19.01.2012, at 20:15, Narendra Sharma wrote:
>>> I believe you need to move the nodes on the ring. What was the load on the 
>>> nodes before you added 5 new nodes? Its just that you are getting data in 
>>> certain token range more than others.
>> With three nodes, it was also imbalanced. 
>> 
>> What I don't understand is, why the md5 sums would generate such massive hot 
>> spots. 
>> 
>> Most of our keys look like that: 
>> 00013270494972450001234567
>> with the first 16 digits being a timestamp of one of our application 
>> server's startup times, and the last 10 digits being sequentially generated 
>> per user. 
>> 
>> There may be a lot of keys that start with e.g. "0001327049497245"  (or some 
>> other time stamp). But I was under the impression that md5 doesn't bother 
>> and generates uniform distribution?
>> But then again, I know next to nothing about md5. Maybe someone else has a 
>> better insight to the algorithm?
>> 
>> However, we also use cfs with a date ("mmdd") as key, as well as cfs 
>> with uuids as keys. And those cfs in itself are not balanced either. E.g. 
>> node 5 has 12 GB live space used in the cf the uuid as key, and node 8 only 
>> 428MB. 
>> 
>> Cheers,
>> Marcel
>> 
>>> 
>>> On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach 
>>>  wrote:
>>> On 18.01.2012, at 02:19, Maki Watanabe wrote:
 Are there any significant difference of number of sstables on each nodes?
>>> No, no significant difference there. Actually, node 8 is among those with 
>>> more sstables but with the least load (20GB)
>>> 
>>> On 17.01.2012, at 20:14, Jeremiah Jordan wrote:
 Are you deleting data or using TTL's?  Expired/deleted data won't go away 
 until the sstable holding it is compacted.  So if compaction has happened 
 on some nodes, but not on others, you will see this.  The disparity is 
 pretty big 400Gb to 20GB, so this probably isn't the issue, but with our 
 data using TTL's if I run major compactions a couple times on that column 
 family it can shrink ~30%-40%.
>>> Yes, we do delete data. But I agree, the disparity is too big to blame only 
>>> the deletions. 
>>> 
>>> Also, initially, we started out with 3 nodes and upgraded to 8 a few weeks 
>>> ago. After adding the node, we did
>>> compactions and cleanups and didn't have a balanced cluster. So that should 
>>> have removed outdated data, right?
>>> 
 2012/1/18 Marcel Steinbach :
> We are running regular repairs, so I don't think that's the problem.
> And the data dir sizes match approx. the load from the nodetool.
> Thanks for the advise, though.
> 
> Our keys are digits only, and all contain a few zeros at the same
> offsets. I'm not that familiar with the md5 algorithm, but I doubt that it
> would generate 'hotspots' for those kind of keys, right?
> 
> On 17.01.2012, at 17:34, Mohit Anchlia wrote:
> 
> Have you tried running repair first on each node? Also, verify using
> df -h on the data dirs
> 
> On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach
>  wrote:
> 
> Hi,
> 
> 
> we're using RP and have each node assigned th

Re: Data Model Question

2012-01-21 Thread R. Verlangen
A couple of days ago I came across Countandra ( http://countandra.org/ ).
It seems that it might be a solution for you.

Gr. Robin

2012/1/20 Tamar Fraenkel 

> **
>
>   Hi!
>
> I am a newbie to Cassandra and seeking some advice regarding the data
> model I should use to best address my needs.
>
> For simplicity, what I want to accomplish is:
>
> I have a system that has users (potentially ~10,000 per day) and they
> perform actions in the system (total of ~50,000 a day).
>
> Each User’s action is taking place in a certain point in time, and is also
> classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s
> Categories and Tags has a score associated with it, the score is between 0
> to 1 (let’s assume precision of 0.0001).
>
> I want to be able to identify similar actions in the system (performed
> usually by more than one user). Similarity of actions is calculated based
> on their common Categories and Tags taking scores into account.
>
> I need the system to store:
>
>- The list of my users with attributes like name, age etc
>- For each action – the categories and tags associated with it and
>their score, the time of the action, and the user who performed it.
>- Groups of similar actions (ActionGroups) – the id’s of actions in
>the group, the categories and tags describing the group, with their scores.
>Those are calculated using an algorithm that takes into account the
>categories and tags of the actions in the group.
>
> When a user performs a new action in the system, I want to add it to a
> fitting ActionGroups (with similar categories and tags).
>
> For this I need to be able to perform the following:
>
> Find all the recent ActionGroups (those who were updated with actions
> performed during the last T minutes), who has at list one of the new
> action’s categories AND at list one of the new action’s tags.
>
>
>
> I thought of two ways to address the issue and I would appreciate your
> insights.
>
>
>
> First one using secondary indexes
>
> Column Family: *Users*
>
> Key: userId
>
> Compare with Bytes Type
>
> Columns: name: <>, age: <> etc…
>
>
>
> Column Family: *Actions*
>
> Key: actionId
>
> Compare with Bytes Type
>
> Columns:  Category1 :  ….
>
>   CategoriN: ,
>
>   Tag1 : , ….
>
>   TagK:
>
>   Time: timestamp
>
>   user: userId
>
>
>
> Column Family: *ActionGroups*
>
> Key: actionGroupId
>
> Compare with Bytes Type
>
> Columns: Category1 :  ….
>
>  CategoriN: ,
>
>  Tag1 :  ….
>
>  TagK:
>
>  lastUpdateTime: timestamp
>
>  actionId1: null, … ,
>
>  actionIdM: null
>
>
>
> I will then define secondary index on each tag columns, category columns,
> and the update time column.
>
> Let’s assume the new action I want to add to ActionGroup has
> NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 –
> NewActionTagN. I will perform the following query:
>
> Select  * From ActionGroups where
>
>(NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
>
>(NewActionTag1 > 0  … or NewActionTagN > 0) and
>
>lastUpdateTime > T;
>
>
>
> Second solution
>
> Have the same CF as in the first solution *without the secondary* *index*, 
> and have two additional CF-ies:
>
> Column Family: *CategoriesToActionGroupId*
>
> Key: categoryId
>
> Compare with ByteType
>
> Columns: {Timestamp, ActionGroupsId1 } : null
>
>  {Timestamp, ActionGroupsId2} : null
>
>  ...
>
> *timestamp is the update time for the ActionGroup
>
>
>
> A similar CF will be defined for tags.
>
>
>
> I will then be able to run several queries on CategoriesToActionGroupId
> (one for each of the new story Categories), with column slice for the right
> update time of the ActionGroup.
>
> I will do the same for the TagsToActionGroupId.
>
> I will then use my client code to remove duplicates (ActionGroups who are
> associated with more than one Tag or Category).
>
>
>
> My questions are:
>
>1. Are the two solutions viable? If yes, which is better
>2. Is there any better way of doing this?
>3. Can I use jdbc and CQL with both method, or do I have to use Hector
>(I am using Java).
>
> Thanks
>
> Tamar
>
>
>
>
>


Re: Cassandra to Oracle?

2012-01-21 Thread Eric Czech
Hi Brian,

We're trying to do the exact same thing and I find myself asking very
similar questions.

Our solution though has been to find what kind of queries we need to
satisfy on a preemptive basis and leverage cassandra's built-in indexing
features to build those result sets beforehand.  The whole point here then
is that our gain in cost efficiency comes from the fact that disk space is
really cheap and serving up result sets from disk is fast provided that
those result sets are pre-calculated and reasonable in size (even if we
don't know all the values upfront).  For example, when you're writing to
your CF "X", you could also make writes to column family "A" like this:

- write A[Z][Y] = 1
where A = CF, Z = key, Y = column

Answering the question "select count(distinct Y) from X group by Z" then is
as simple as getting a list of rows for CF A and counting the distinct
values of Y and grouping them by Z on the client side.

Alternatively, there are much better ways to do this with composite
keys/columns and distributed counters but it's hard for me to tell what
makes the most sense without knowing more about your data / product
requirements.

Either way, I feel your pain in getting things like this to work with
Cassandra when the domain of values for a particular key or column is
unknown and secondary indexing doesn't apply, but I'm positive there's a
much cheaper way to make it work than paying for Oracle if you have at
least a decent idea about what kinds of queries you need to satisfy (which
it sounds like you do).  To Maxim's "death by index" point, you could
certainly go overboard with this concept and cross a pricing threshold with
some other database technology, but I can't imagine you're even close to
being in that boat given how concise your query needs seem to be.

If you're interested, I'd be happy to share how we do these things to save
lots of money over commercial databases and try to relate that to your use
case, but if not, then I hope at least some of that this useful for you.

Good luck either way!

On Fri, Jan 20, 2012 at 9:27 PM, Maxim Potekhin  wrote:

> I certainly agree with "difficult to predict". There is a Danish
> proverb, which goes "it's difficult to make predictions, especially
> about the future".
>
> My point was that it's equally difficult with noSQL and RDBMS.
> The latter requires indexing to operate well, and that's a potential
> performance problem.
>
>
> On 1/20/2012 7:55 PM, Mohit Anchlia wrote:
>
>> I think the problem stems when you have data in a column that you need
>> to run adhoc query on which is not denormalized. In most cases it's
>> difficult to predict the type of query that would be required.
>>
>> Another way of solving this could be to index the fields in search engine.
>>
>> On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhin  wrote:
>>
>>> What makes you think that RDBMS will give you acceptable performance?
>>>
>>> I guess you will try to index it to death (because otherwise the "ad hoc"
>>> queries won't work well if at all), and at this point you may be hit
>>> with a
>>> performance penalty.
>>>
>>> It may be a good idea to interview users and build denormalized views in
>>> Cassandra, maybe on a separate "look-up" cluster. A few percent of users
>>> will be unhappy, but you'll find it hard to do better. I'm talking from
>>> my
>>> experience with an industrial strength RDBMS which doesn't scale very
>>> well
>>> for what you call "ad-hoc" queries.
>>>
>>> Regards,
>>> Maxim
>>>
>>>
>>>
>>>
>>>
>>> On 1/20/2012 9:28 AM, Brian O'Neill wrote:
>>>

 I can't remember if I asked this question before, but

 We're using Cassandra as our transactional system, and building up
 quite a
 library of map/reduce jobs that perform data quality analysis,
 statistics,
 etc.
 (>  100 jobs now)

 But... we are still struggling to provide an "ad-hoc" query mechanism
 for
 our users.

 To fill that gap, I believe we still need to materialize our data in an
 RDBMS.

 Anyone have any ideas?  Better ways to support ad-hoc queries?

 Effectively, our users want to be able to select count(distinct Y) from
 X
 group by Z.
 Where Y and Z are arbitrary columns of rows in X.

 We believe we can create column families with different key structures
 (using Y an Z as row keys), but some column names we don't know / can't
 predict ahead of time.

 Are people doing bulk exports?
 Anyone trying to keep an RDBMS in synch in real-time?

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science 
 (http://healthmarketscience.**com
 )
 mobile:215.588.6024
 blog: 
 http://weblogs.java.net/blog/**boneill42/
 blog: http://brianoneill.blogspot.**com/


>


Re: Get all keys from the cluster

2012-01-21 Thread Eric Czech
Great!  I'm glad at least one of those ideas was helpful for you.

That's a road we've travelled before and as one last suggestion that might
help, you could alter all client writers to cassandra beforehand so that
they write to BOTH keyspaces BEFORE beginning the SQL based transfer.  This
might help keep you from having to make multiple passes unless I'm not
missing something.

On Sat, Jan 21, 2012 at 4:53 AM, Marcel Steinbach  wrote:

>Thanks for your suggestions, Eric!
>
> One of the application uses 1.5TB out of 1.8TB
>
> I'm sorry, maybe that statment was slightly ambiguous. I meant to say,
> that one application uses 1.5TB, while the others use 300GB, totalling in
> 1.8TB of data. Our total disk capacity, however, is at about 7 TB, so we're
> still far from running out of disk space.
>
> Is there any way that you could do that lookup in reverse where you pull
> the records from your SQL database, figure out which keys aren't necessary,
> and then delete any unnecessary keys that may or may not exist in
> cassandra?
>
> Unfortunately, that won't work since the SQL db does only contain the
> keys, that we want to _keep_ in cassandra.
>
> If that's not a possibility, then what about creating the same Cassandra
> schema in a different keyspace and copying all the relevant records from
> the current keyspace to the new keyspace using the SQL database records as
> a basis for what is actually "relevant" within the new keyspace.
>
> I like that idea. So instead of iterating over all cassandra rows, I would
> iterate over the SQL DB, which would indeed save me a lot of IO. However,
> rows inserted into my CF during iterating over the SQL DB might not be
> copied into the new keyspace. But maybe we could arrange to do that
> during low-demand-hours to minimize the amount of new inserts and
> additionally run the "copy" a second time with a select on newly inserted
> sql rows. So we'll probably go with that.
>
> Thanks again for your help!
>
> Cheers
> Marcel
>
> On 21.01.2012, at 11:52, Eric Czech wrote:
>
> Is there any way that you could do that lookup in reverse where you pull
> the records from your SQL database, figure out which keys aren't necessary,
> and then delete any unnecessary keys that may or may not exist in
> cassandra?
>
> If that's not a possibility, then what about creating the same Cassandra
> schema in a different keyspace and copying all the relevant records from
> the current keyspace to the new keyspace using the SQL database records as
> a basis for what is actually "relevant" within the new keyspace.  If you
> could perform that transfer, then you could just delete the old 1.5TB
> keyspace altogether, leaving only the data you need.  If that sort of
> duplication would put you over the 1.8TB limit during the transfer, then
> maybe you could consider CF compression upfront.
>
> Short of that, I can tell from experience that doing these sort of "left
> join" deletes from cassandra to SQL really suck.  We have had to resort to
> using hadoop to do this but since our hadoop/cassandra clusters are much
> larger than our single SQL instances, keeping all the hadoop processes from
> basically "DDoS"ing our SQL servers while still making the process faster
> than thrift iterations over all the rows (via custom programs) in cassandra
> hasn't been a convincing solution.
>
> I'd say that the first solution I proposed is definitely the best, but
> also the most unrealistic.  If that's really not a possibility for you,
> then I'd seriously look at trying to make my second suggestion work even if
> it means brining up new hardware or increasing the capacity of existing
> resources.  That second suggestion also has the added benefit of likely
> minimizing I/O since it's the only solution that doesn't require reading or
> deleting any of the unnecessary data (beyond wholesale keyspace or CF
> deletions) assuming that the actually relevant portion of your data is
> significantly less than 1.5TB.
>
> I hope that helps!
>
> And in the future, you should really try to avoid letting your data size
> get beyond 40 - 50 % of your actual on-disk capacity.  Let me know if
> anyone in the community disagrees, but I'd say you're about 600 GB past the
> point at which you have a lot of easy outs -- but I hope you find one
> anyways!
>
>
> On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach <
> marcel.steinb...@chors.de> wrote:
>
>> We're running a 8 node cluster with different CFs for different
>> applications. One of the application uses 1.5TB out of 1.8TB in total, but
>> only because we started out with a deletion mechanism and implemented one
>> later on. So there is probably a high amount of old data in there, that we
>> don't even use anymore.
>>
>> Now we want to delete that data. To know, which rows we may delete, we
>> have to lookup a SQL database. If the key is not in there anymore, we may
>> delete that row in cassandra, too.
>>
>> This basically means, we have to iterate over all the rows in tha

Re: Get all keys from the cluster

2012-01-21 Thread Marcel Steinbach
Thanks for your suggestions, Eric!

> One of the application uses 1.5TB out of 1.8TB

I'm sorry, maybe that statment was slightly ambiguous. I meant to say, that one 
application uses 1.5TB, while the others use 300GB, totalling in 1.8TB of data. 
Our total disk capacity, however, is at about 7 TB, so we're still far from 
running out of disk space.

> Is there any way that you could do that lookup in reverse where you pull the 
> records from your SQL database, figure out which keys aren't necessary, and 
> then delete any unnecessary keys that may or may not exist in cassandra? 
Unfortunately, that won't work since the SQL db does only contain the keys, 
that we want to _keep_ in cassandra.

> If that's not a possibility, then what about creating the same Cassandra 
> schema in a different keyspace and copying all the relevant records from the 
> current keyspace to the new keyspace using the SQL database records as a 
> basis for what is actually "relevant" within the new keyspace.  
I like that idea. So instead of iterating over all cassandra rows, I would 
iterate over the SQL DB, which would indeed save me a lot of IO. However, rows 
inserted into my CF during iterating over the SQL DB might not be copied into 
the new keyspace. But maybe we could arrange to do that 
during low-demand-hours to minimize the amount of new inserts and additionally 
run the "copy" a second time with a select on newly inserted sql rows. So we'll 
probably go with that.

Thanks again for your help!

Cheers
Marcel

On 21.01.2012, at 11:52, Eric Czech wrote:

> Is there any way that you could do that lookup in reverse where you pull the 
> records from your SQL database, figure out which keys aren't necessary, and 
> then delete any unnecessary keys that may or may not exist in cassandra?  
> 
> If that's not a possibility, then what about creating the same Cassandra 
> schema in a different keyspace and copying all the relevant records from the 
> current keyspace to the new keyspace using the SQL database records as a 
> basis for what is actually "relevant" within the new keyspace.  If you could 
> perform that transfer, then you could just delete the old 1.5TB keyspace 
> altogether, leaving only the data you need.  If that sort of duplication 
> would put you over the 1.8TB limit during the transfer, then maybe you could 
> consider CF compression upfront.
> 
> Short of that, I can tell from experience that doing these sort of "left 
> join" deletes from cassandra to SQL really suck.  We have had to resort to 
> using hadoop to do this but since our hadoop/cassandra clusters are much 
> larger than our single SQL instances, keeping all the hadoop processes from 
> basically "DDoS"ing our SQL servers while still making the process faster 
> than thrift iterations over all the rows (via custom programs) in cassandra 
> hasn't been a convincing solution.
> 
> I'd say that the first solution I proposed is definitely the best, but also 
> the most unrealistic.  If that's really not a possibility for you, then I'd 
> seriously look at trying to make my second suggestion work even if it means 
> brining up new hardware or increasing the capacity of existing resources.  
> That second suggestion also has the added benefit of likely minimizing I/O 
> since it's the only solution that doesn't require reading or deleting any of 
> the unnecessary data (beyond wholesale keyspace or CF deletions) assuming 
> that the actually relevant portion of your data is significantly less than 
> 1.5TB.  
> 
> I hope that helps!
> 
> And in the future, you should really try to avoid letting your data size get 
> beyond 40 - 50 % of your actual on-disk capacity.  Let me know if anyone in 
> the community disagrees, but I'd say you're about 600 GB past the point at 
> which you have a lot of easy outs -- but I hope you find one anyways!
> 
> 
> On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach  
> wrote:
> We're running a 8 node cluster with different CFs for different applications. 
> One of the application uses 1.5TB out of 1.8TB in total, but only because we 
> started out with a deletion mechanism and implemented one later on. So there 
> is probably a high amount of old data in there, that we don't even use 
> anymore.
> 
> Now we want to delete that data. To know, which rows we may delete, we have 
> to lookup a SQL database. If the key is not in there anymore, we may delete 
> that row in cassandra, too.
> 
> This basically means, we have to iterate over all the rows in that CF. This 
> kind of begs for hadoop, but that seems not to be an option, currently. I 
> tried.
> 
> So we figured, we could run over the sstables files (maybe only the index), 
> check the keys in the mysql, and later run the deletes on the cluster. This 
> way, we could iterate on each node in parallel.
> 
> Does that sound reasonable? Any pros/cons, maybe a "killer" argument to use 
> hadoop for that?
> 
> Cheers
> Marcel
> 
> chors GmbH
> 
> specialists in digital 

Re: ideal cluster size

2012-01-21 Thread Eric Czech
I'd also add that one of the biggest complications to arise from having
multiple clusters is that read biased client applications would need to be
aware of all clusters and either aggregate result sets or involve logic to
choose the right cluster based on a particular query.

And from a more operational perspective, I think you'd have a tough time
find monitoring applications (like Opscenter) that would support multiple
clusters within the same viewport.  Having used multiple clusters in the
past, I can definitely tell you that from an administrative, operational,
and development standpoint, one cluster is almost definitely better than
many.

Oh and I'm positive that there are other cassandra deployments out there
with well beyond 100 nodes so I don't thinking you're really treading on
dangerous ground here.

I'd definitely say that you should try to use a single cluster if possible.

On Fri, Jan 20, 2012 at 9:34 PM, Maxim Potekhin  wrote:

> You can also scale not "horizontally" but "diagonally",
> i.e. raid SSDs and have multicore CPUs. This means that
> you'll have same performance with less nodes, making
> it far easier to manage.
>
> SSDs by themselves will give you an order of magnitude
> improvement on I/O.
>
>
>
> On 1/19/2012 9:17 PM, Thorsten von Eicken wrote:
>
>> We're embarking on a project where we estimate we will need on the order
>> of 100 cassandra nodes. The data set is perfectly partitionable, meaning
>> we have no queries that need to have access to all the data at once. We
>> expect to run with RF=2 or =3. Is there some notion of ideal cluster
>> size? Or perhaps asked differently, would it be easier to run one large
>> cluster or would it be easier to run a bunch of, say, 16 node clusters?
>> Everything we've done to date has fit into 4-5 node clusters.
>>
>
>


Re: Proposal to lower the minimum limit for phi_convict_threshold

2012-01-21 Thread Radim Kolar
Anyway, I can't find any reason to limit minimum value of 
phi_convict_threshold to 5. maki
In real world you often want to have 9 because cassandra is too much 
sensitive to overloaded LAN and nodes are flipping up/down often and 
creating chaos in cluster if you have larger number of nodes (let say 
30) because state changes are not propagating too quickly.


Re: Get all keys from the cluster

2012-01-21 Thread Eric Czech
Is there any way that you could do that lookup in reverse where you pull
the records from your SQL database, figure out which keys aren't necessary,
and then delete any unnecessary keys that may or may not exist in
cassandra?

If that's not a possibility, then what about creating the same Cassandra
schema in a different keyspace and copying all the relevant records from
the current keyspace to the new keyspace using the SQL database records as
a basis for what is actually "relevant" within the new keyspace.  If you
could perform that transfer, then you could just delete the old 1.5TB
keyspace altogether, leaving only the data you need.  If that sort of
duplication would put you over the 1.8TB limit during the transfer, then
maybe you could consider CF compression upfront.

Short of that, I can tell from experience that doing these sort of "left
join" deletes from cassandra to SQL really suck.  We have had to resort to
using hadoop to do this but since our hadoop/cassandra clusters are much
larger than our single SQL instances, keeping all the hadoop processes from
basically "DDoS"ing our SQL servers while still making the process faster
than thrift iterations over all the rows (via custom programs) in cassandra
hasn't been a convincing solution.

I'd say that the first solution I proposed is definitely the best, but also
the most unrealistic.  If that's really not a possibility for you, then I'd
seriously look at trying to make my second suggestion work even if it means
brining up new hardware or increasing the capacity of existing resources.
 That second suggestion also has the added benefit of likely minimizing I/O
since it's the only solution that doesn't require reading or deleting any
of the unnecessary data (beyond wholesale keyspace or CF deletions)
assuming that the actually relevant portion of your data is significantly
less than 1.5TB.

I hope that helps!

And in the future, you should really try to avoid letting your data size
get beyond 40 - 50 % of your actual on-disk capacity.  Let me know if
anyone in the community disagrees, but I'd say you're about 600 GB past the
point at which you have a lot of easy outs -- but I hope you find one
anyways!


On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach  wrote:

> We're running a 8 node cluster with different CFs for different
> applications. One of the application uses 1.5TB out of 1.8TB in total, but
> only because we started out with a deletion mechanism and implemented one
> later on. So there is probably a high amount of old data in there, that we
> don't even use anymore.
>
> Now we want to delete that data. To know, which rows we may delete, we
> have to lookup a SQL database. If the key is not in there anymore, we may
> delete that row in cassandra, too.
>
> This basically means, we have to iterate over all the rows in that CF.
> This kind of begs for hadoop, but that seems not to be an option,
> currently. I tried.
>
> So we figured, we could run over the sstables files (maybe only the
> index), check the keys in the mysql, and later run the deletes on the
> cluster. This way, we could iterate on each node in parallel.
>
> Does that sound reasonable? Any pros/cons, maybe a "killer" argument to
> use hadoop for that?
>
> Cheers
> Marcel
> 
> chors GmbH
> 
> specialists in digital and direct marketing solutions
> Haid-und-Neu-Straße 7
> 76131 Karlsruhe, Germany
> www.chors.com
> Managing Directors: Dr. Volker Hatz, Markus PlattnerAmtsgericht
> Montabaur, HRB 15029
> This e-mail is for the intended recipient only
> and may contain confidential or privileged information. If you have
> received this e-mail by mistake, please contact us immediately and
> completely delete it (and any attachments) and do not forward it or inform
> any other person of its contents. If you send us messages by e-mail, we
> take this as your authorization to correspond with you by e-mail. E-mail
> transmission cannot be guaranteed to be secure or error-free as information
> could be intercepted, amended, corrupted, lost, destroyed, arrive late or
> incomplete, or contain viruses. Neither chors GmbH nor the sender accept
> liability for any errors or omissions in the content of this message which
> arise as a result of its e-mail transmission. Please note that all e-mail
> communications to and from chors GmbH may be monitored.
>


Get all keys from the cluster

2012-01-21 Thread Marcel Steinbach
We're running a 8 node cluster with different CFs for different applications. 
One of the application uses 1.5TB out of 1.8TB in total, but only because we 
started out with a deletion mechanism and implemented one later on. So there is 
probably a high amount of old data in there, that we don't even use anymore. 

Now we want to delete that data. To know, which rows we may delete, we have to 
lookup a SQL database. If the key is not in there anymore, we may delete that 
row in cassandra, too. 

This basically means, we have to iterate over all the rows in that CF. This 
kind of begs for hadoop, but that seems not to be an option, currently. I tried.

So we figured, we could run over the sstables files (maybe only the index), 
check the keys in the mysql, and later run the deletes on the cluster. This 
way, we could iterate on each node in parallel. 

Does that sound reasonable? Any pros/cons, maybe a "killer" argument to use 
hadoop for that?

Cheers
Marcel
chors GmbH

specialists in digital and direct marketing solutions
Haid-und-Neu-Straße 7
76131 Karlsruhe, Germany
www.chors.com
Managing Directors: Dr. Volker Hatz, Markus PlattnerAmtsgericht 
Montabaur, HRB 15029
This e-mail is for the intended recipient only and may 
contain confidential or privileged information. If you have received this 
e-mail by mistake, please contact us immediately and completely delete it (and 
any attachments) and do not forward it or inform any other person of its 
contents. If you send us messages by e-mail, we take this as your authorization 
to correspond with you by e-mail. E-mail transmission cannot be guaranteed to 
be secure or error-free as information could be intercepted, amended, 
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. 
Neither chors GmbH nor the sender accept liability for any errors or omissions 
in the content of this message which arise as a result of its e-mail 
transmission. Please note that all e-mail communications to and from chors GmbH 
may be monitored.

Proposal to lower the minimum limit for phi_convict_threshold

2012-01-21 Thread Maki Watanabe
Hello,
The current trunk limit the value of phi_convict_threshold from 5 to 16
in DatabaseDescriptor.java.
And phi value is calculated in FailureDetector.java as

  PHI_FACTOR x time_since_last_gossip / mean_heartbeat_interval

And the PHI_FACTOR is a predefined value:
PHI_FACTOR = 1 / Log(10) =~ 0.43

So if you use default phi_convict_threshold = 8, it means FailureDetector
wait for 8 / 0.43 = 18.6 times of "missing heartbeats" before it detects
node failure.

Even if you set the minimum value 5, it needs 5 / 0.43 = 11.6 heartbeat
miss for failure detection.
I think it is a bit much for cassandra ring which build on reliable network
in single datacenter.
If DatabaseDescriptor.java will accepts smaller phi_convict_threshold,
we will be able to configure cassandra to detect failure rapidly ( or
shoot my foot ).
Anyway, I can't find any reason to limit minimum value of phi_convict_threshold
to 5.

maki