Re: Smart Table creation for 2D range query

2017-05-09 Thread Jim Ancona
Couldn't you use a bucketing strategy for the hash value, much like with
time series data? That is, choose a partition key granularity that puts a
reasonable number of rows in a partition, with the actual hash being the
clustering key. Then ranges that within the partition key granularity could
be efficiently queried.

Jim

On Tue, May 9, 2017 at 11:19 AM, Jon Haddad 
wrote:

> The problem with using geohashes is that you can’t efficiently do ranges
> with random token distribution.  So even if your scalar values are close to
> each other numerically they’ll likely end up on different nodes, and you
> end up doing a scatter gather.
>
> If the goal is to provide a scalable solution, building a table that
> functions as an R-Tree or Quad Tree is the only way I know that can solve
> the problem without scanning the entire cluster.
>
> Jon
>
> On May 9, 2017, at 10:11 AM, Jim Ancona  wrote:
>
> There are clever ways to encode coordinates into a single scalar value
> where points that are close on a surface are also close in value, making
> queries efficient. Examples are Geohash
> <https://en.wikipedia.org/wiki/Geohash> and Google's S2
> <https://docs.google.com/presentation/d/1Hl4KapfAENAOf4gv-pSngKwvS_jwNVHRPZTTDzXXn6Q/view#slide=id.i0>.
> As Jon mentions, this puts more work on the client, but might give you a
> lot of querying flexibility when using Cassandra.
>
> Jim
>
> On Mon, May 8, 2017 at 11:13 PM, Jon Haddad 
> wrote:
>
>> It gets a little tricky when you try to add in the coordinates to the
>> clustering key if you want to do operations that are more complex.  For
>> instance, finding all the elements within a radius of point (x,y) isn’t
>> particularly fun with Cassandra.  I recommend moving that logic into the
>> application.
>>
>> > On May 8, 2017, at 10:06 PM, kurt greaves  wrote:
>> >
>> > Note that will not give you the desired range queries of 0 >= x <= 1
>> and 0 >= y <= 1.
>> >
>> >
>> > ​Something akin to Jon's solution could give you those range queries if
>> you made the x and y components part of the clustering key.
>> >
>> > For example, a space of (1,1) could contain all x,y coordinates where x
>> and y are > 0 and <= 1. You would then have a table like:
>> >
>> > CREATE TABLE geospatial (
>> > space text,
>> > x double,
>> > y double,
>> > item text,
>> > m1,
>> > m2,
>> > m3,
>> > primary key ((space), x, y, m1, m2, m3, m4, m5)
>> > );
>> >
>> > A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2
>> and y>0.1; should yield all x and y pairs and their distinct metadata. Or
>> something like that anyway.
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>
>


Re: Smart Table creation for 2D range query

2017-05-09 Thread Jim Ancona
There are clever ways to encode coordinates into a single scalar value
where points that are close on a surface are also close in value, making
queries efficient. Examples are Geohash
 and Google's S2
.
As Jon mentions, this puts more work on the client, but might give you a
lot of querying flexibility when using Cassandra.

Jim

On Mon, May 8, 2017 at 11:13 PM, Jon Haddad 
wrote:

> It gets a little tricky when you try to add in the coordinates to the
> clustering key if you want to do operations that are more complex.  For
> instance, finding all the elements within a radius of point (x,y) isn’t
> particularly fun with Cassandra.  I recommend moving that logic into the
> application.
>
> > On May 8, 2017, at 10:06 PM, kurt greaves  wrote:
> >
> > Note that will not give you the desired range queries of 0 >= x <= 1 and
> 0 >= y <= 1.
> >
> >
> > ​Something akin to Jon's solution could give you those range queries if
> you made the x and y components part of the clustering key.
> >
> > For example, a space of (1,1) could contain all x,y coordinates where x
> and y are > 0 and <= 1. You would then have a table like:
> >
> > CREATE TABLE geospatial (
> > space text,
> > x double,
> > y double,
> > item text,
> > m1,
> > m2,
> > m3,
> > primary key ((space), x, y, m1, m2, m3, m4, m5)
> > );
> >
> > A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2
> and y>0.1; should yield all x and y pairs and their distinct metadata. Or
> something like that anyway.
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Effective partition key for time series data, which allows range queries?

2017-04-05 Thread Jim Ancona
That's an interesting refinement! I'll keep it in mind the next time this
sort of thing comes up.

Jim

On Wed, Apr 5, 2017 at 9:22 AM, Eric Stevens  wrote:

> Jim's basic model is similar to how we've solved this exact kind of
> problem many times.  From my own experience, I strongly recommend that you
> make a `bucket` field in the partition key, and a `time` field in the
> cluster key.  Make both of these of data type `timestamp`.  Then use
> application logic to floor the bucket to an appropriate interval according
> to your chosen bucket size.
>
> The reason is that as your data needs grow, the one thing you can be
> pretty confident in is that your assumptions about data density per
> partition will turn out to be eventually wrong.  This is either because of
> expanding requirements (you're adding new fields to this table), because of
> increased application usage (you're being successful!), or because you
> didn't anticipate a use case with different data density per bucket than
> you anticipated (you're not prescient).
>
> It's easy in application code to adjust your timestamp interval if your
> keying allows for arbitrary adjustments.  Most often you're going to end up
> making smaller buckets as your needs progress.  The upshot is that with a
> little careful selection of bucketing strategy, partition key range
> iterations are still correct if you adjust from say a 24 hour bucket to a
> 12 hour, 6 hour, 3 hour, 1 hour, 30 minute, 15 minute, or 1 minute bucket.
> The data written under the larger bucket size still lands on a smaller
> bucket interval, so you don't really even need to use complex logic in the
> application to adapt to the new bucket size.  You definitely don't want to
> paint yourself into a corner where you need a smaller bucket size but your
> data model didn't leave room for it.
>
> On Tue, Apr 4, 2017 at 2:59 PM Jim Ancona  wrote:
>
>> The typical recommendation for maximum partition size is on the order of
>> 100mb and/or 100,000 rows. That's not a hard limit, but you may be setting
>> yourself up for issues as you approach or exceed those numbers.
>>
>> If you need to reduce partition size, the typical way to do this is by
>> "bucketing," that is adding a synthetic column to the partition key to
>> separate the data into separate buckets. In your example above, I assume
>> that your current primary key is (user, timestamp), where user is the
>> partition key and timestamp is the clustering key. Say that you want to
>> store a maximum of a years worth of data in a partition. You would make the
>> primary key be ((user, year), timestamp). The partition key is now
>> (user, year) where year is the year part of the timestamp. Now if you want
>> to query the data for last month, you would do:
>>
>> select * from data where user_id = 'foo' and year = 2017 and timestamp >=
>> '<1 month ago>' and timestamp <= ''
>>
>>
>> If you wanted to get the data for the last 6 month, you'd do something
>> like:
>>
>> select * from data where user_id = 'foo' and year in (2016, 2017) and
>> timestamp >= '<6 months ago>' and timestamp <= ''  (Notice that
>> because the query spans two years, you have to include both years in the
>> select criteria so that C* knows which partitions to query. )
>>
>>
>> You can make the buckets smaller (e.g. weeks, days, hours instead of
>> years), but of course querying multiple buckets is less efficient, so it's
>> worth making your buckets as large as you can without making them too big.
>>
>> Hope this helps!
>>
>> Jim
>>
>>
>>
>>
>> On Mon, Mar 27, 2017 at 8:47 PM, Ali Akhtar  wrote:
>>
>> I have a use case where the data for individual users is being tracked,
>> and every 15 minutes or so, the data for the past 15 minutes is inserted
>> into the table.
>>
>> The table schema looks like:
>> user id, timestamp, foo, bar, etc.
>>
>> Where foo, bar, etc are the items being tracked, and their values over
>> the past 15 minutes.
>>
>> I initially planned to use the user id as the primary key of the table.
>> But, I realized that this may cause really wide rows ( tracking for 24
>> hours means 96 records inserted (1 for each 15 min window), over 1 year
>> this means 36k records per user, over 2 years, 72k, etc).
>>
>> I know the  limit of wide rows is billions of records, but I've heard
>> that the practical limit is much lower.
>>
>> So I considered using a composite primary key: (user, timestamp)
>>
>> If I'm correct, the above should create a new row for each user &
>> timestamp logged.
>>
>> However, will i still be able to do range queries on the timestamp, to
>> e.g return the data for the last week?
>>
>> E.g select * from data where user_id = 'foo' and timestamp >= '<1 month
>> ago>' and timestamp <= '' ?
>>
>>
>>


Re: Effective partition key for time series data, which allows range queries?

2017-04-04 Thread Jim Ancona
The typical recommendation for maximum partition size is on the order of
100mb and/or 100,000 rows. That's not a hard limit, but you may be setting
yourself up for issues as you approach or exceed those numbers.

If you need to reduce partition size, the typical way to do this is by
"bucketing," that is adding a synthetic column to the partition key to
separate the data into separate buckets. In your example above, I assume
that your current primary key is (user, timestamp), where user is the
partition key and timestamp is the clustering key. Say that you want to
store a maximum of a years worth of data in a partition. You would make the
primary key be ((user, year), timestamp). The partition key is now (user,
year) where year is the year part of the timestamp. Now if you want to
query the data for last month, you would do:

select * from data where user_id = 'foo' and year = 2017 and timestamp >=
'<1 month ago>' and timestamp <= ''


If you wanted to get the data for the last 6 month, you'd do something
like:

select * from data where user_id = 'foo' and year in (2016, 2017) and
timestamp >= '<6 months ago>' and timestamp <= ''  (Notice that
because the query spans two years, you have to include both years in the
select criteria so that C* knows which partitions to query. )


You can make the buckets smaller (e.g. weeks, days, hours instead of
years), but of course querying multiple buckets is less efficient, so it's
worth making your buckets as large as you can without making them too big.

Hope this helps!

Jim




On Mon, Mar 27, 2017 at 8:47 PM, Ali Akhtar  wrote:

> I have a use case where the data for individual users is being tracked,
> and every 15 minutes or so, the data for the past 15 minutes is inserted
> into the table.
>
> The table schema looks like:
> user id, timestamp, foo, bar, etc.
>
> Where foo, bar, etc are the items being tracked, and their values over the
> past 15 minutes.
>
> I initially planned to use the user id as the primary key of the table.
> But, I realized that this may cause really wide rows ( tracking for 24
> hours means 96 records inserted (1 for each 15 min window), over 1 year
> this means 36k records per user, over 2 years, 72k, etc).
>
> I know the  limit of wide rows is billions of records, but I've heard that
> the practical limit is much lower.
>
> So I considered using a composite primary key: (user, timestamp)
>
> If I'm correct, the above should create a new row for each user &
> timestamp logged.
>
> However, will i still be able to do range queries on the timestamp, to e.g
> return the data for the last week?
>
> E.g select * from data where user_id = 'foo' and timestamp >= '<1 month
> ago>' and timestamp <= '' ?
>
>


Re: How to query '%' character using LIKE operator in Cassandra 3.7?

2016-09-22 Thread Jim Ancona
To answer DuyHai's question without introducing new syntax, I'd suggest:

LIKE '%%%escape' means STARTS WITH '%' AND ENDS WITH 'escape'

So the first two %'s are translated to a literal, non-wildcard % and the
third % is a wildcard because it's not doubled.

Jim

On Thu, Sep 22, 2016 at 11:40 AM, Mikhail Krupitskiy <
mikhail.krupits...@jetbrains.com> wrote:

> I guess that it should be similar to how it is done in SQL for LIKE
> patterns.
>
> You can introduce an escape character, e.g. ‘\’.
> Examples:
> ‘%’ - any string
> ‘\%’ - equal to ‘%’ character
> ‘\%foo%’ - starts from ‘%foo’
> ‘%%%escape’ - ends with ’escape’
> ‘\%%’ - starts from ‘%’
> ‘\\\%%’ - starts from ‘\%’ .
>
> What do you think?
>
> Thanks,
> Mikhail
>
> On 22 Sep 2016, at 16:47, DuyHai Doan  wrote:
>
> Hello Mikhail
>
> It's more complicated that it seems
>
> LIKE '%%escape' means  EQUAL TO '%escape'
>
> LIKE '%escape' means ENDS WITH 'escape'
>
> What's about LIKE '%%%escape' 
>
> How should we treat this case ? Replace %% by % at the beginning of the
> searched term ??
>
>
>
> On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy <
> mikhail.krupits...@jetbrains.com> wrote:
>
>> Hi!
>>
>> We’ve talked about two items:
>> 1) ‘%’ as a wildcard in the middle of LIKE pattern.
>> 2) How to escape ‘%’ to be able to find strings with the ‘%’ char with
>> help of LIKE.
>>
>> Item #1was resolved as CASSANDRA-12573.
>>
>> Regarding to item #2: you said the following:
>>
>> A possible fix would be:
>>
>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
>> the column data type)
>> 2) remove the escape character e.g. before parsing OR use some advanced
>> regex to exclude the %% from parsing e.g
>>
>> Step 2) is dead easy but step 1) is harder because I don't know if
>> converting the bytebuffer into String at this stage of the CQL parser is
>> expensive or not (in term of computation)
>>
>> Let me try a patch
>>
>> So is there any update on this?
>>
>> Thanks,
>> Mikhail
>>
>>
>> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy <
>> mikhail.krupits...@jetbrains.com> wrote:
>>
>> Hi!
>>
>> Have you had a chance to try your patch or solve the issue in an other
>> way?
>>
>> Thanks,
>> Mikhail
>>
>> On 15 Sep 2016, at 16:02, DuyHai Doan  wrote:
>>
>> Ok so I've found the source of the issue, it's pretty well hidden because
>> it is NOT in the SASI source code directly.
>>
>> Here is the method where C* determines what kind of LIKE expression
>> you're using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
>>
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/
>> apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
>>
>> As you can see, it's pretty simple, maybe too simple. Indeed, they forget
>> to remove escape character BEFORE doing the matching so if your search is 
>> LIKE
>> '%%esc%', the detected expression is LIKE_CONTAINS.
>>
>> A possible fix would be:
>>
>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on
>> the column data type)
>> 2) remove the escape character e.g. before parsing OR use some advanced
>> regex to exclude the %% from parsing e.g
>>
>> Step 2) is dead easy but step 1) is harder because I don't know if
>> converting the bytebuffer into String at this stage of the CQL parser is
>> expensive or not (in term of computation)
>>
>> Let me try a patch
>>
>>
>>
>> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan 
>> wrote:
>>
>>> Ok you're right, I get your point
>>>
>>> LIKE '%%esc%' --> startWith('%esc')
>>>
>>> LIKE 'escape%%' -->  = 'escape%'
>>>
>>> What I strongly suspect is that in the source code of SASI, we parse the
>>> % xxx % expression BEFORE applying escape. That will explain the observed
>>> behavior. E.g:
>>>
>>> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
>>>
>>> LIKE 'escape%%' parsed as xxx% where xxx =escape%
>>>
>>> Let me check in the source code and try to reproduce the issue
>>>
>>>
>>>
>>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy <
>>> mikhail.krupits...@jetbrains.com> wrote:
>>>
 Looks like we have different understanding of what results are expected.
 I based my understanding on http://docs.datastax.com/en
 /cql/3.3/cql/cql_using/useSASIIndex.html
 According to the doc ‘esc’ is a pattern for exact match and I guess
 that there is no semantical difference between two LIKE patterns (both of
 patterns should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.

 SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results
 *containing* '%esc' so *%esc*apeme is a possible match and also escape
 *%esc*

 Why ‘containing’? I expect that it should be ’starting’..


 SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results
 *starting* with 'escape%' so *escape%*me is a valid result and also
 *escape%*esc

 Why ’starting’? I expect that it should be ‘exact matching’.

 Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but

Re: Isolation in case of Single Partition Writes and Batching with LWT

2016-09-12 Thread Jim Ancona
Mark,

Is there some official Apache policy on which sites it's appropriate to
link to on an Apache mailing list? If so, could you please post a link to
it so we can all understand the rules. Or is this your personal opinion on
what you'd like to see here?

Thanks!

On Mon, Sep 12, 2016 at 7:34 AM, Mark Thomas  wrote:

> On 11/09/2016 23:07, Ryan Svihla wrote:
> > 1. A batch with updates to a single partition turns into a single
> > mutation so partition writes aren't possible (so may as well use
> > Unlogged batches)
> > 2. Yes, so use local_serial or serial reads and all updates you want to
> > honor LWT need to be LWT as well, this way everything is buying into the
> > same protocol and behaving accordingly.
> > 3. LWT works with batch (has to be same partition).
> > https://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html if
> > condition doesn't fire none of the batch will (same partition will mean
> > it'll be the same mutation anyway so there really isn't any magic going
> on).
>
> Is there a good reason for linking to the 3rd party docs rather than the
> official docs in this case? I can't see one at the moment.
>
> The official docs appear to be:
>
> http://cassandra.apache.org/doc/latest/cql/dml.html#batch
>
> It might not matter in this particular instance but it looks as if there
> is a little more to the syntax than the 3rd party docs suggest (even if
> you switch to the latest version of those 3rd party docs).
>
> Generally, if you are going to point to docs, please point to the
> official Apache Cassandra docs unless there is a very good reason not
> to. (And if the good reason is that there’s a deficiency in the Apache
> Cassandra docs, please make it known on the list or in a Jira so someone
> can write what’s missing)
>
> Mark
>
>
> > Your biggest issue with such a design will be contention (as it would
> > with an rdbms with say row locking), as by intent you're making all
> > reads and writes block until any pending ones are complete. I'm sure
> > there are a couple things I forgot but this is the standard wisdom.
> >
> > Regards,
> >
> > Ryan Svihla
> >
> > On Sep 11, 2016, at 3:49 PM, Jens Rantil  > > wrote:
> >
> >> Hi,
> >>
> >> This might be off-topic, but you could always use Zookeeper locking
> >> and/or Apache Kafka topic keys for doing things like this.
> >>
> >> Cheers,
> >> Jens
> >>
> >> On Tuesday, September 6, 2016, Bhuvan Rawal  >> > wrote:
> >>
> >> Hi,
> >>
> >> We are working to solve on a multi threaded distributed design
> >> which in which a thread reads current state from Cassandra (Single
> >> partition ~ 20 Rows), does some computation and saves it back in.
> >> But it needs to be ensured that in between reading and writing by
> >> that thread any other thread should not have saved any operation
> >> on that partition.
> >>
> >> We have thought of a solution for the same - *having a write_time
> >> column* in the schema and making it static. Every time the thread
> >> picks up a job read will be performed with LOCAL_QUORUM. While
> >> writing into Cassandra batch will contain a LWT (IF write_time is
> >> read time) otherwise read will be performed and computation will
> >> be done again and so on. This will ensure that while saving
> >> partition is in a state it was read from.
> >>
> >> In order to avoid race condition we need to ensure couple of things:
> >>
> >> 1. While saving data in a batch with a single partition (*Rows may
> >> be Updates, Deletes, Inserts)* are they Isolated per replica node.
> >> (Not necessarily on a cluster as a whole). Is there a possibility
> >> of client reading partial rows?
> >>
> >> 2. If we do a LOCAL_QUORUM read and LOCAL_QUORUM writes in this
> >> case could there a chance of inconsistency in this case (When LWT
> >> is being used in batches).
> >>
> >> 3. Is it possible to use multiple LWT in a single Batch? In
> >> general how does LWT performs with Batch and is Paxos acted on
> >> before batch execution?
> >>
> >> Can someone help us with this?
> >>
> >> Thanks & Regards,
> >> Bhuvan
> >>
> >>
> >>
> >> --
> >> Jens Rantil
> >> Backend engineer
> >> Tink AB
> >>
> >> Email: jens.ran...@tink.se 
> >> Phone: +46 708 84 18 32
> >> Web: www.tink.se 
> >>
> >> Facebook  Linkedin
> >>  companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%
> 2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary> Twitter
> >> 
> >>
>
>


Re: Support/Consulting companies

2016-08-19 Thread Jim Ancona
There's also a list of companies that provide Cassandra-related services on
the wiki:

https://wiki.apache.org/cassandra/ThirdPartySupport

Jim

On Fri, Aug 19, 2016 at 3:37 PM, Chris Tozer 
wrote:

> Instaclustr ( Instaclustr.com ) also offers Cassandra consulting
>
>
> On Friday, August 19, 2016,  wrote:
>
>> Yes, TLP is the place to go!
>>
>> Joe
>>
>> Sent from my iPhone
>>
>> On Aug 19, 2016, at 12:03 PM, Huang, Roger  wrote:
>>
>> http://thelastpickle.com/
>>
>>
>>
>>
>>
>> *From:* Roxy Ubi [mailto:roxy...@gmail.com]
>> *Sent:* Friday, August 19, 2016 2:02 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Support/Consulting companies
>>
>>
>>
>> Howdy,
>>
>> I'm looking for a list of support or consulting companies that provide
>> contracting services related to Cassandra.  Is there a comprehensive list
>> somewhere?  Alternatively could you folks tell me who you use?
>>
>> Thanks in advance for any replies!
>>
>> Roxy
>>
>>
>
> --
> Chris Tozer
>
> Instaclustr
>
> (408) 781-7914
>
> Spin Up a Free 14 Day Trial 
>
>


Re: Migrating to CQL and Non Compact Storage

2016-04-11 Thread Jim Ancona
On Mon, Apr 11, 2016 at 4:19 PM, Jack Krupansky 
wrote:

> Some of this may depend on exactly how you are using so-called COMPACT
> STORAGE. I mean, if your tables really are modeled as all but exactly one
> column in the primary key, then okay, COMPACT STORAGE may be a reasonable
> model, but that seems to be a very special, narrow use case, so for all
> other cases you really do need to re-model for CQL for Cassandra 4.0.
>
There was no such restriction when modeling with Thrift. It's an artifact
of how CQL chose to expose the Thrift data model.

I'm not sure why anybody is thinking otherwise. Sure, maybe will be a lot
> of work, but that's life and people have been given plenty of notice.
>
"That's life" minimizes the difficulty of doing this sort of migration for
large, mission-critical systems. It would require large amounts of time, as
well as temporarily doubling hardware resources amounting to dozens up to
hundreds of nodes.

And if it takes hours to do a data migration, I think that you can consider
> yourself lucky relative to people who may require days.
>
Or more.

Now, if there are particular Thrift use cases that don't have efficient
> models in CQL, that can be discussed. Start by expressing the Thrift data
> in a neutral, natural, logical, plain English data model, and then we can
> see how that maps to CQL.
>
> So, where are we? Is it just the complaint that migration is slow and
> re-modeling is difficult, or are there specific questions about how to do
> the re-modeling?
>
My purpose is not to complain, but to educate :-). Telling someone "just
remodel your data" is not helpful, especially after he's told you that he
tried that and ran into performance issues. (Note that the link he posted
shows an order of magnitude decrease in throughput when moving from COMPACT
STORE to CQL3 native tables for analytics workloads, so it's not just his
use case.) Do you have any suggestions of ways he might mitigate those
issues? Is there information you need to make such a recommendation?

Jim


>
>
> -- Jack Krupansky
>
> On Mon, Apr 11, 2016 at 1:30 PM, Anuj Wadehra 
> wrote:
>
>> Thanks Jim. I think you understand the pain of migrating TBs of data to
>> new tables. There is no command to change from compact to non compact
>> storage and the fastest solution to migrate data using Spark is too slow
>> for production systems.
>>
>> And the pain gets bigger when your performance dips after moving to non
>> compact storage table. Thats because non compact storage is quite
>> inefficient storage format till 3.x and its incurs heavy penalty on Row
>> Scan performance in Analytics workload.
>> Please go throught the link to understand how old Compact storage gives
>> much better performance than non compact storage as far as Row Scans are
>> concerned:
>> https://www.oreilly.com/ideas/apache-cassandra-for-analytics-a-performance-and-storage-analysis
>>
>> The flexibility of Cql comes at heavy cost until 3.x.
>>
>>
>>
>> Thanks
>> Anuj
>> Sent from Yahoo Mail on Android
>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>
>> On Mon, 11 Apr, 2016 at 10:35 PM, Jim Ancona
>>  wrote:
>> Jack, the Datastax link he posted (
>> http://www.datastax.com/dev/blog/thrift-to-cql3) says that for column
>> families with mixed dynamic and static columns: "The only solution to be
>> able to access the column family fully is to remove the declared columns
>> from the thrift schema altogether..." I think that page describes the
>> problem and the potential solutions well. I haven't seen an answer to
>> Anuj's question about why the native CQL solution using collections doesn't
>> perform as well.
>>
>> Keep in mind that some of us understand CQL just fine but have working
>> pre-CQL Thrift-based systems storing hundreds of terabytes of data and with
>> requirements that mean that saying "bite the bullet and re-model your
>> data" is not really helpful. Another quote from that Datastax link:
>> "Thrift isn't going anywhere." Granted that that link is three-plus years
>> old, but Thrift now *is* now going away, so it's not unexpected that people
>> will be trying to figure out how to deal with that. It's bad enough that we
>> need to rewrite our clients to use CQL instead of Thrift. It's not helpful
>> to say that we should also re-model and migrate all our data.
>>
>> Jim
>>
>> On Mon, Apr 11, 2016 at 11:29 AM, Jack Krupansky <
>> jack.krupan...@gmail.com> wrote:
>>
>>> Sorry, but your message is too confusing - yo

Re: Migrating to CQL and Non Compact Storage

2016-04-11 Thread Jim Ancona
Jack, the Datastax link he posted (
http://www.datastax.com/dev/blog/thrift-to-cql3) says that for column
families with mixed dynamic and static columns: "The only solution to be
able to access the column family fully is to remove the declared columns
from the thrift schema altogether..." I think that page describes the
problem and the potential solutions well. I haven't seen an answer to
Anuj's question about why the native CQL solution using collections doesn't
perform as well.

Keep in mind that some of us understand CQL just fine but have working
pre-CQL Thrift-based systems storing hundreds of terabytes of data and with
requirements that mean that saying "bite the bullet and re-model your data"
is not really helpful. Another quote from that Datastax link: "Thrift isn't
going anywhere." Granted that that link is three-plus years old, but Thrift
now *is* now going away, so it's not unexpected that people will be trying
to figure out how to deal with that. It's bad enough that we need to
rewrite our clients to use CQL instead of Thrift. It's not helpful to say
that we should also re-model and migrate all our data.

Jim

On Mon, Apr 11, 2016 at 11:29 AM, Jack Krupansky 
wrote:

> Sorry, but your message is too confusing - you say "reading dynamic
> columns in CQL" and "make the table schema less", but neither has any
> relevance to CQL! 1. CQL tables always have schemas. 2. All columns in CQL
> are statically declared (even maps/collections are statically declared
> columns.) Granted, it is a challenge for Thrift users to get used to the
> terminology of CQL, but it is required. If necessary, review some of the
> free online training videos for data modeling.
>
> Unless your data model is very simply and does directly translate into
> CQL, you probably do need to bite the bullet and re-model your data to
> exploit the features of CQL rather than fight CQL trying to mimic Thrift
> per se.
>
> In any case, take another shot at framing the problem and then maybe
> people here can help you out.
>
> -- Jack Krupansky
>
> On Mon, Apr 11, 2016 at 10:39 AM, Anuj Wadehra 
> wrote:
>
>> Any comments or suggestions on this one?
>>
>> Thanks
>> Anuj
>>
>> Sent from Yahoo Mail on Android
>> 
>>
>> On Sun, 10 Apr, 2016 at 11:39 PM, Anuj Wadehra
>>  wrote:
>> Hi
>>
>> We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but
>> facing some challenges.
>>
>> We have a cf with a mix of statically defined columns and dynamic columns
>> (created at run time). For reading dynamic columns in CQL,
>> we have two options:
>>
>> 1. Drop all columns and make the table schema less. This way, we will get
>> a Cql row for each column defined for a row key--As mentioned here:
>> http://www.datastax.com/dev/blog/thrift-to-cql3
>>
>> 2.Migrate entire data to a new non compact storage table and create
>> collections for dynamic columns in new table.
>>
>> In our case, we have observed that approach 2 causes 3 times slower
>> performance in Range scan queries used by Spark. This is not acceptable.
>> Cassandra 3 has optimized storage engine but we are not comfortable moving
>> to 3.x in production.
>>
>> Moreover, data migration to new table using Spark takes hours.
>>
>> Any suggestions for the two issues?
>>
>>
>> Thanks
>> Anuj
>>
>>
>> Sent from Yahoo Mail on Android
>> 
>>
>>
>


Re: Is it possible to achieve "sticky" request routing?

2016-04-05 Thread Jim Ancona
Jon and Steve:

I don't understand your point. The TokenAwareLoadBalancer identifies the
nodes in the cluster that own the data for a particular token and route
requests to one of them. As I understand it, the OP wants to send requests
for a particular token to the same node every time (assuming it's
available). How does that fail in a large cluster?

Jim

On Tue, Apr 5, 2016 at 4:31 PM, Jonathan Haddad  wrote:

> Yep - Steve hit the nail on the head.  The odds of hitting the right
> server with "sticky routing" goes down as your cluster size increases.  You
> end up adding extra network hops instead of using token aware routing.
>
> Unless you're trying to do a coordinator tier (and you're not, according
> to your original post), this is a pretty bad idea and I'd advise you to
> push back on that requirement.
>
> On Tue, Apr 5, 2016 at 12:47 PM Steve Robenalt 
> wrote:
>
>> Aside from Jon's "why" question, I would point out that this only really
>> works because you are running a 3 node cluster with RF=3. If your cluster
>> is going to grow, you can't guarantee that any one server would have all
>> records. I'd be pretty hesitant to put an invisible constraint like that on
>> a cluster unless you're pretty sure it'll only ever be 3 nodes.
>>
>> On Tue, Apr 5, 2016 at 9:34 AM, Jonathan Haddad 
>> wrote:
>>
>>> Why is this a requirement?  Honestly I don't know why you would do this.
>>>
>>>
>>> On Sat, Apr 2, 2016 at 8:06 PM Mukil Kesavan 
>>> wrote:
>>>
 Hello,

 We currently have 3 Cassandra servers running in a single datacenter
 with a replication factor of 3 for our keyspace. We also use the
 SimpleSnitch wiith DynamicSnitching enabled by default. Our load balancing
 policy is TokenAwareLoadBalancingPolicy with RoundRobinPolicy as the child.
 This overall configuration results in our client requests spreading equally
 across our 3 servers.

 However, we have a new requirement where we need to restrict a client's
 requests to a single server and only go to the other servers on failure of
 the previous server. This particular use case does not have high request
 traffic.

 Looking at the documentation the options we have seem to be:

 1. Play with the snitching (e.g. place each server into its own DC or
 Rack) to ensure that requests always go to one server and failover to the
 others if required. I understand that this may also affect replica
 placement and we may need to run nodetool repair. So this is not our most
 preferred option.

 2. Write a new load balancing policy that also uses the
 HostStateListener for tracking host up and down messages, that essentially
 accomplishes "sticky" request routing with failover to other nodes.

 Is option 2 the only clean way of accomplishing our requirement?

 Thanks,
 Micky

>>>
>>
>>
>> --
>> Steve Robenalt
>> Software Architect
>> sroben...@highwire.org 
>> (office/cell): 916-505-1785
>>
>> HighWire Press, Inc.
>> 425 Broadway St, Redwood City, CA 94063
>> www.highwire.org
>>
>> Technology for Scholarly Communication
>>
>


Re: best ORM for cassandra

2016-02-10 Thread Jim Ancona
Recent versions of the Datastax Java Driver include an object mapping API
that might work for you:
http://docs.datastax.com/en/latest-java-driver/java-driver/reference/objectMappingApi.html

Jim

On Wed, Feb 10, 2016 at 4:29 AM, Nirmallya Mukherjee 
wrote:

> I have heard of that but I like to reduce layers & components in my
> architecture. If my DAO can directly use the C* driver then I believe I am
> better of. I am sure you already know there are many benefits of the driver
> - auto discovery, seamless failover, retry, various balancing policies etc
> etc.
>
> No doubts technically both are possible but as I mentioned before I prefer
> to have a slim DAO.
>
> Thanks,
> Nirmallya
>
>
> On Wednesday, 10 February 2016 2:41 PM, Karthik Prasad Manchala <
> karthikp.manch...@impetus.co.in> wrote:
>
>
> Hi Nirmallya,
>
> You can try using Kundera (https://github.com/impetus-opensource/Kundera),
> a JPA 2.1 compliant Object-Datastore Mapping Library for major NoSql
> datastores. It also supports Polyglot persistence out-of-the-box.
>
> Quick start ->
> https://github.com/impetus-opensource/Kundera/wiki/Getting-Started-in-5-minutes
>
> Thanks and regards,
> Karthik.
> --
> *From:* Nirmallya Mukherjee 
> *Sent:* 10 February 2016 12:47
> *To:* user@cassandra.apache.org
> *Subject:* Re: best ORM for cassandra
>
> You are probably better off using the driver as you do not need an ORM.
> Cassandra is not a relational DB. You can find the documentation here
> http://docs.datastax.com/en/developer/driver-matrix/doc/common/driverMatrix.html.
> Pick the version that is applicable to your project.
>
> Thanks,
> Nirmallya
>
>
> On Wednesday, 10 February 2016 12:31 PM, Raman Gugnani <
> raman.gugn...@snapdeal.com> wrote:
>
>
> Hi
>
> I am new to cassandra.
>
> I am developing an application with cassandra.
> Which is the best ORM for cassandra?
>
> --
> Thanks & Regards
>
> Raman Gugnani
> *Senior Software Engineer | CaMS*
> M: +91 8588892293 | T: 0124-660 | EXT: 14255
> ASF Centre A | 2nd Floor | CA-2130 | Udyog Vihar Phase IV |
> Gurgaon | Haryana | India
>
> *Disclaimer:* This communication is for the sole use of the addressee and
> is confidential and privileged information. If you are not the intended
> recipient of this communication, you are prohibited from disclosing it and
> are required to delete it forthwith. Please note that the contents of this
> communication do not necessarily represent the views of Jasper Infotech
> Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
> secure or error-free as information could be intercepted, corrupted, lost,
> destroyed, arrive late or incomplete, or contain viruses. The Company,
> therefore, does not accept liability for any loss caused due to this
> communication. *Jasper Infotech Private Limited, Registered Office: 1st
> Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
> U72300DL2007PTC168097*
>
>
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>
>
>


Re: Writing a large blob returns WriteTimeoutException

2016-02-08 Thread Jim Ancona
The "if not exists" in your INSERT means that you are incurring a
performance hit by using Paxos. Do you need that? Have you tried your test
without  it?

Jim


Re: Cassandra Connection Pooling

2016-01-28 Thread Jim Ancona
It's typically handled by your client (e.g.
https://docs.datastax.com/en/latest-java-driver/index.html) along with
retries, timeouts and all the other things you would put in your datasource
config for a SQL database in JBoss.


On Thu, Jan 28, 2016 at 5:31 PM, KAMM, BILL  wrote:

> Hi, I’m looking for some good info on connection pooling, using JBoss.  Is
> this something that needs to be configured within JBoss, or is it handled
> directly by the Cassandra classes themselves?  Thanks.
>
>
>
> Bill
>
>
>


Re: Data Modeling: Partition Size and Query Efficiency

2016-01-06 Thread Jim Ancona
On Tue, Jan 5, 2016 at 5:52 PM, Jonathan Haddad  wrote:

> You could keep a "num_buckets" value associated with the client's account,
> which can be adjusted accordingly as usage increases.
>

Yes, but the adjustment problem is tricky when there are multiple
concurrent writers. What happens when you change the number of buckets?
Does existing data have to be re-written into new buckets? If so, how do
you make sure that's only done once for each bucket size increase? Or
perhaps I'm misunderstanding your suggestion?

Jim


> On Tue, Jan 5, 2016 at 2:17 PM Jim Ancona  wrote:
>
>> On Tue, Jan 5, 2016 at 4:56 PM, Clint Martin <
>> clintlmar...@coolfiretechnologies.com> wrote:
>>
>>> What sort of data is your clustering key composed of? That might help
>>> some in determining a way to achieve what you're looking for.
>>>
>> Just a UUID that acts as an object identifier.
>>
>>>
>>> Clint
>>> On Jan 5, 2016 2:28 PM, "Jim Ancona"  wrote:
>>>
>>>> Hi Nate,
>>>>
>>>> Yes, I've been thinking about treating customers as either small or
>>>> big, where "small" ones have a single partition and big ones have 50 (or
>>>> whatever number I need to keep sizes reasonable). There's still the problem
>>>> of how to handle a small customer who becomes too big, but that will happen
>>>> much less frequently than a customer filling a partition.
>>>>
>>>> Jim
>>>>
>>>> On Tue, Jan 5, 2016 at 12:21 PM, Nate McCall 
>>>> wrote:
>>>>
>>>>>
>>>>>> In this case, 99% of my data could fit in a single 50 MB partition.
>>>>>> But if I use the standard approach, I have to split my partitions into 50
>>>>>> pieces to accommodate the largest data. That means that to query the 700
>>>>>> rows for my median case, I have to read 50 partitions instead of one.
>>>>>>
>>>>>> If you try to deal with this by starting a new partition when an old
>>>>>> one fills up, you have a nasty distributed consensus problem, along with
>>>>>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>>>>>> with this, but might help with the consensus part today. But there are
>>>>>> still some nasty corner cases.
>>>>>>
>>>>>> I have some thoughts on other ways to solve this, but they all have
>>>>>> drawbacks. So I thought I'd ask here and hope that someone has a better
>>>>>> approach.
>>>>>>
>>>>>>
>>>>> Hi Jim - good to see you around again.
>>>>>
>>>>> If you can segment this upstream by customer/account/whatever,
>>>>> handling the outliers as an entirely different code path (potentially
>>>>> different cluster as the workload will be quite different at that point 
>>>>> and
>>>>> have different tuning requirements) would be your best bet. Then a
>>>>> read-before-write makes sense given it is happening on such a small number
>>>>> of API queries.
>>>>>
>>>>>
>>>>> --
>>>>> -
>>>>> Nate McCall
>>>>> Austin, TX
>>>>> @zznate
>>>>>
>>>>> Co-Founder & Sr. Technical Consultant
>>>>> Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com
>>>>>
>>>>
>>>>


Re: Data Modeling: Partition Size and Query Efficiency

2016-01-05 Thread Jim Ancona
On Tue, Jan 5, 2016 at 4:56 PM, Clint Martin <
clintlmar...@coolfiretechnologies.com> wrote:

> What sort of data is your clustering key composed of? That might help some
> in determining a way to achieve what you're looking for.
>
Just a UUID that acts as an object identifier.

>
> Clint
> On Jan 5, 2016 2:28 PM, "Jim Ancona"  wrote:
>
>> Hi Nate,
>>
>> Yes, I've been thinking about treating customers as either small or big,
>> where "small" ones have a single partition and big ones have 50 (or
>> whatever number I need to keep sizes reasonable). There's still the problem
>> of how to handle a small customer who becomes too big, but that will happen
>> much less frequently than a customer filling a partition.
>>
>> Jim
>>
>> On Tue, Jan 5, 2016 at 12:21 PM, Nate McCall 
>> wrote:
>>
>>>
>>>> In this case, 99% of my data could fit in a single 50 MB partition. But
>>>> if I use the standard approach, I have to split my partitions into 50
>>>> pieces to accommodate the largest data. That means that to query the 700
>>>> rows for my median case, I have to read 50 partitions instead of one.
>>>>
>>>> If you try to deal with this by starting a new partition when an old
>>>> one fills up, you have a nasty distributed consensus problem, along with
>>>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>>>> with this, but might help with the consensus part today. But there are
>>>> still some nasty corner cases.
>>>>
>>>> I have some thoughts on other ways to solve this, but they all have
>>>> drawbacks. So I thought I'd ask here and hope that someone has a better
>>>> approach.
>>>>
>>>>
>>> Hi Jim - good to see you around again.
>>>
>>> If you can segment this upstream by customer/account/whatever, handling
>>> the outliers as an entirely different code path (potentially different
>>> cluster as the workload will be quite different at that point and have
>>> different tuning requirements) would be your best bet. Then a
>>> read-before-write makes sense given it is happening on such a small number
>>> of API queries.
>>>
>>>
>>> --
>>> -
>>> Nate McCall
>>> Austin, TX
>>> @zznate
>>>
>>> Co-Founder & Sr. Technical Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>>


Re: Data Modeling: Partition Size and Query Efficiency

2016-01-05 Thread Jim Ancona
Hi Nate,

Yes, I've been thinking about treating customers as either small or big,
where "small" ones have a single partition and big ones have 50 (or
whatever number I need to keep sizes reasonable). There's still the problem
of how to handle a small customer who becomes too big, but that will happen
much less frequently than a customer filling a partition.

Jim

On Tue, Jan 5, 2016 at 12:21 PM, Nate McCall  wrote:

>
>> In this case, 99% of my data could fit in a single 50 MB partition. But
>> if I use the standard approach, I have to split my partitions into 50
>> pieces to accommodate the largest data. That means that to query the 700
>> rows for my median case, I have to read 50 partitions instead of one.
>>
>> If you try to deal with this by starting a new partition when an old one
>> fills up, you have a nasty distributed consensus problem, along with
>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>> with this, but might help with the consensus part today. But there are
>> still some nasty corner cases.
>>
>> I have some thoughts on other ways to solve this, but they all have
>> drawbacks. So I thought I'd ask here and hope that someone has a better
>> approach.
>>
>>
> Hi Jim - good to see you around again.
>
> If you can segment this upstream by customer/account/whatever, handling
> the outliers as an entirely different code path (potentially different
> cluster as the workload will be quite different at that point and have
> different tuning requirements) would be your best bet. Then a
> read-before-write makes sense given it is happening on such a small number
> of API queries.
>
>
> --
> -
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>


Re: Data Modeling: Partition Size and Query Efficiency

2016-01-05 Thread Jim Ancona
Hi Jack,

Thanks for your response. My answers inline...

On Tue, Jan 5, 2016 at 11:52 AM, Jack Krupansky 
wrote:

> Jim, I don't quite get why you think you would need to query 50 partitions
> to return merely hundreds or thousands of rows. Please elaborate. I mean,
> sure, for that extreme 100th percentile, yes, you would query a lot of
> partitions, but for the 90th percentile it would be just one. Even the 99th
> percentile would just be one or at most a few.
>
Exactly, but, as I mentioned in my email, the normal way of segmenting
large partitions is to use some deterministic bucketing mechanism to bucket
rows into different partitions. If you know of a way to make the number of
buckets vary with the number of rows, I'd love to hear about it.

It would help if you could elaborate on the actual access pattern - how
> rapidly is the data coming in and from where. You can do just a little more
> work at the app level and and use Cassandra more effectively.
>
 The write pattern is batches of inserts/updates mixed with some single row
inserts/updates. Not surprisingly, the customers with more data also do
more writes.


> As always, we look to queries to determine what the Cassandra data model
> should look like, so elaborate what your app needs to see. What exactly is
> the app querying for - a single key, a slice, or... what?
>
The use case here is sequential access to some or all or a customer's rows
in order to filter based on other criteria. The order doesn't matter much,
as long as it's well-defined.


> And, as always, you commonly need to store the data in multiple query
> tables so that the data model matches the desired query pattern.
>
> Are the row sizes very dynamic, with some extremely large, or is it just
> the number of rows that is making size an issue?
>
No, row sizes don't vary much, just the number of rows per customer.


>
> Maybe let the app keep a small cache of active partitions and their
> current size so that the app can decide when to switch to a new bucket. Do
> a couple of extra queries when a key is not in that cache to determine what
> the partition size and count to initialize the cache entry for a key. If
> necessary, keep a separate table that tracks the partition size or maybe
> just the (rough) row count to use to determine when a new partition is
> needed.
>

I've done almost exactly what you suggest in a previous application. The
issue is that the cache of active partitions needs to be consistent for
multiple writers and the transition from one bucket to the next really
wants to be transactional. Hence my reference to a "nasty distributed
consensus problem" and Clint's reference to an "anti-pattern". I'd like to
avoid it if I can.

Jim


>
> -- Jack Krupansky
>
> On Tue, Jan 5, 2016 at 11:07 AM, Jim Ancona  wrote:
>
>> Thanks for responding!
>>
>> My natural partition key is a customer id. Our customers have widely
>> varying amounts of data. Since the vast majority of them have data that's
>> small enough to fit in a single partition, I'd like to avoid imposing
>> unnecessary overhead on the 99% just to avoid issues with the largest 1%.
>>
>> The approach to querying across multiple partitions you describe is
>> pretty much what I have in mind. The trick is to avoid having to query 50
>> partitions to return a few hundred or thousand rows.
>>
>> I agree that sequentially filling partitions is something to avoid.
>> That's why I'm hoping someone can suggest a good alternative.
>>
>> Jim
>>
>>
>>
>>
>>
>> On Mon, Jan 4, 2016 at 8:07 PM, Clint Martin <
>> clintlmar...@coolfiretechnologies.com> wrote:
>>
>>> You should endeavor to use a repeatable method of segmenting your data.
>>> Swapping partitions every time you "fill one" seems like an anti pattern to
>>> me. but I suppose it really depends on what your primary key is. Can you
>>> share some more information on this?
>>>
>>> In the past I have utilized the consistent hash method you described
>>> (add an artificial row key segment by modulo some part of the clustering
>>> key by a fixed position count) combined with a lazy evaluation cursor.
>>>
>>> The lazy evaluation cursor essentially is set up to query X number of
>>> partitions simultaneously, but to execute those queries only add needed to
>>> fill the page size. To perform paging you have to know the last primary key
>>> that was returned so you can use that to limit the next iteration.
>>>
>>> You can trade latency for additional work load by controlling the number
>>> of conc

Re: Data Modeling: Partition Size and Query Efficiency

2016-01-05 Thread Jim Ancona
Thanks for responding!

My natural partition key is a customer id. Our customers have widely
varying amounts of data. Since the vast majority of them have data that's
small enough to fit in a single partition, I'd like to avoid imposing
unnecessary overhead on the 99% just to avoid issues with the largest 1%.

The approach to querying across multiple partitions you describe is pretty
much what I have in mind. The trick is to avoid having to query 50
partitions to return a few hundred or thousand rows.

I agree that sequentially filling partitions is something to avoid. That's
why I'm hoping someone can suggest a good alternative.

Jim





On Mon, Jan 4, 2016 at 8:07 PM, Clint Martin <
clintlmar...@coolfiretechnologies.com> wrote:

> You should endeavor to use a repeatable method of segmenting your data.
> Swapping partitions every time you "fill one" seems like an anti pattern to
> me. but I suppose it really depends on what your primary key is. Can you
> share some more information on this?
>
> In the past I have utilized the consistent hash method you described (add
> an artificial row key segment by modulo some part of the clustering key by
> a fixed position count) combined with a lazy evaluation cursor.
>
> The lazy evaluation cursor essentially is set up to query X number of
> partitions simultaneously, but to execute those queries only add needed to
> fill the page size. To perform paging you have to know the last primary key
> that was returned so you can use that to limit the next iteration.
>
> You can trade latency for additional work load by controlling the number
> of concurrent executions you do as the iterating occurs. Or you can
> minimize the work on your cluster by querying each partition one at a time.
>
> Unfortunately due to the artificial partition key segment you cannot
> iterate or page in any particular order...(at least across partitions)
> Unless your hash function can also provide you some ordering guarantees.
>
> It all just depends on your requirements.
>
> Clint
> On Jan 4, 2016 10:13 AM, "Jim Ancona"  wrote:
>
>> A problem that I have run into repeatedly when doing schema design is how
>> to control partition size while still allowing for efficient multi-row
>> queries.
>>
>> We want to limit partition size to some number between 10 and 100
>> megabytes to avoid operational issues. The standard way to do that is to
>> figure out the maximum number of rows that your "natural partition key"
>> will ever need to support and then add an additional artificial partition
>> key that segments the rows sufficiently to get keep the partition size
>> under the maximum. In the case of time series data, this is often done by
>> bucketing by time period, i.e. creating a new partition every minute, hour
>> or day. For non-time series data by doing something like
>> Hash(clustering-key) mod desired-number-of-partitions.
>>
>> In my case, multi-row queries to support a REST API typically return a
>> page of results, where the page size might be anywhere from a few dozen up
>> to thousands. For query efficiency I want the average number of rows per
>> partition to be large enough that a query can be satisfied by reading a
>> small number of partitions--ideally one.
>>
>> So I want to simultaneously limit the maximum number of rows per
>> partition and yet maintain a large enough average number of rows per
>> partition to make my queries efficient. But with my data the ratio between
>> maximum and average can be very large (up to four orders of magnitude).
>>
>> Here is an example:
>>
>>
>> Rows per Partition
>>
>> Partition Size
>>
>> Mode
>>
>> 1
>>
>> 1 KB
>>
>> Median
>>
>> 500
>>
>> 500 KB
>>
>> 90th percentile
>>
>> 5,000
>>
>> 5 MB
>>
>> 99th percentile
>>
>> 50,000
>>
>> 50 MB
>>
>> Maximum
>>
>> 2,500,000
>>
>> 2.5 GB
>>
>> In this case, 99% of my data could fit in a single 50 MB partition. But
>> if I use the standard approach, I have to split my partitions into 50
>> pieces to accommodate the largest data. That means that to query the 700
>> rows for my median case, I have to read 50 partitions instead of one.
>>
>> If you try to deal with this by starting a new partition when an old one
>> fills up, you have a nasty distributed consensus problem, along with
>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>> with this, but might help with the consensus part today. But there are
>> still some nasty corner cases.
>>
>> I have some thoughts on other ways to solve this, but they all have
>> drawbacks. So I thought I'd ask here and hope that someone has a better
>> approach.
>>
>> Thanks in advance,
>>
>> Jim
>>
>>


Data Modeling: Partition Size and Query Efficiency

2016-01-04 Thread Jim Ancona
A problem that I have run into repeatedly when doing schema design is how
to control partition size while still allowing for efficient multi-row
queries.

We want to limit partition size to some number between 10 and 100 megabytes
to avoid operational issues. The standard way to do that is to figure out
the maximum number of rows that your "natural partition key" will ever need
to support and then add an additional artificial partition key that
segments the rows sufficiently to get keep the partition size under the
maximum. In the case of time series data, this is often done by bucketing
by time period, i.e. creating a new partition every minute, hour or day.
For non-time series data by doing something like Hash(clustering-key) mod
desired-number-of-partitions.

In my case, multi-row queries to support a REST API typically return a page
of results, where the page size might be anywhere from a few dozen up to
thousands. For query efficiency I want the average number of rows per
partition to be large enough that a query can be satisfied by reading a
small number of partitions--ideally one.

So I want to simultaneously limit the maximum number of rows per partition
and yet maintain a large enough average number of rows per partition to
make my queries efficient. But with my data the ratio between maximum and
average can be very large (up to four orders of magnitude).

Here is an example:


Rows per Partition

Partition Size

Mode

1

1 KB

Median

500

500 KB

90th percentile

5,000

5 MB

99th percentile

50,000

50 MB

Maximum

2,500,000

2.5 GB

In this case, 99% of my data could fit in a single 50 MB partition. But if
I use the standard approach, I have to split my partitions into 50 pieces
to accommodate the largest data. That means that to query the 700 rows for
my median case, I have to read 50 partitions instead of one.

If you try to deal with this by starting a new partition when an old one
fills up, you have a nasty distributed consensus problem, along with
read-before-write. Cassandra LWT wasn't available the last time I dealt
with this, but might help with the consensus part today. But there are
still some nasty corner cases.

I have some thoughts on other ways to solve this, but they all have
drawbacks. So I thought I'd ask here and hope that someone has a better
approach.

Thanks in advance,

Jim


Re: Replicating Data Between Separate Data Centres

2015-12-14 Thread Jim Ancona
Could you define what you mean by Casual Consistency and explain why you
think you won't have that when using LOCAL_QUORUM? I ask because LOCAL_QUORUM
and multiple data centers are the way many of us handle DR, so I'd like to
understand why it doesn't work for you.

I'm afraid I don't understand your scenario. Are you planning on building
out a new recovery DC *after* the primary has failed, or keeping two DCs in
sync so that you can switch over after a failure?

Jim

On Mon, Dec 14, 2015 at 2:59 PM, Philip Persad 
wrote:

> Hi,
>
> I'm currently looking at Cassandra in the context of Disaster Recovery.  I
> have 2 Data Centres, one is the Primary and the other acts as a Standby.
> There is a Cassandra cluster in each Data Centre.  For the time being I'm
> running Cassandra 2.0.9.  Unfortunately, due to the nature of my data, the
> consistency levels that I would get out of LOCAL_QUORUM writes followed by
> asynchronous replication to the secondary data centre are insufficient.  In
> the event of a failure, it is acceptable to lose some data, but I need
> Casual Consistency to be maintained.  Since I don't have the luxury of
> performing nodetool repairs after Godzilla steps on my primary data centre,
> I use more strictly ordered means of transporting events between the Data
> Centres (Kafka for anyone who cares about that detail).
>
> What I'm not sure about, is how to go about copying all the data in one
> Cassandra cluster to a new cluster, either to bring up a new Standby Data
> Centre or as part of failing back to the Primary after I pick up the
> pieces.  I'm thinking that I should either:
>
> 1. Do a snapshot (
> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_takes_snapshot_t.html),
> and then restore that snapshot on my new cluster (
> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_snapshot_restore_new_cluster.html
> )
>
> 2. Join the new data centre to the existing cluster (
> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html).
> Then separate the two data centres into two individual clusters by doing .
> . . something???
>
> Does anyone have any advice about how to tackle this problem?
>
> Many thanks,
>
> -Phil
>


Re: Cassandra users survey

2015-10-01 Thread Jim Ancona
Hi Jonathan,

The survey asks about "your application." We have multiple applications
using Cassandra. Are you looking for information about each application
separately, or the sum of all of them?

Jim

On Wed, Sep 30, 2015 at 2:18 PM, Jonathan Ellis  wrote:

> With 3.0 approaching, the Apache Cassandra team would appreciate your
> feedback as we work on the project roadmap for future releases.
>
> I've put together a brief survey here:
> https://docs.google.com/forms/d/1TEG0umQAmiH3RXjNYdzNrKoBCl1x7zurMroMzAFeG2Y/viewform?usp=send_form
>
> Please take a few minutes to fill it out!
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced
>
>


Re: How to store unique visitors in cassandra

2015-04-01 Thread Jim Ancona
Very interesting. I had saved your email from three years ago in hopes of
an elegant answer. Thanks for sharing!

Jim

On Tue, Mar 31, 2015 at 8:16 AM, Alain RODRIGUEZ  wrote:

> People keep asking me if we finally found a solution (even if this is 3+
> years old) so I will just update this thread with our findings.
>
> We finally achieved doing this thanks to our bigdata and reporting stacks
> by storing blobs corresponding to HLL (HyperLogLog) structures. HLL is an
> algorithm used by Google, twitter and many more to solve count-distinct
> problems. Structures built through this algorithm can be "summed" and give
> a good approximation of the UV number.
>
> Precision you will reach depends on the size of structure you chose
> (predictable precision). You can reach fairly acceptable approximation with
> small data structures.
>
> So we basically store a HLL per hour and just "sum" HLL for all the hours
> between 2 ranges (you can do it at day level or any other level depending
> on your needs).
>
> Hope this will help some of you, we finally had this (good) idea after
> more than 3 years. Actually we use HLL for a long time but the idea of
> storing HLL structures instead of counts allow us to request on custom
> ranges (at the price of more intelligence on the reporting stack that must
> read and smartly sum HLLs stored as blobs). We are happy with it since.
>
> C*heers,
>
> Alain
>
> 2012-01-19 22:21 GMT+01:00 Milind Parikh :
>
>> You might want to look at the code in countandra.org; regardless of
>> whether you use it. It use a model of dynamic composite keys (although
>> static composite keys would have worked as well). For the actual query,only
>> one row is hit. This of course only works bc the data model is attuned for
>> the query.
>>
>> Regards
>> Milind
>>
>> /***
>> sent from my android...please pardon occasional typos as I respond @ the
>> speed of thought
>> /
>>
>> On Jan 19, 2012 1:31 AM, "Alain RODRIGUEZ"  wrote:
>>
>> Hi thanks for your answer but I don't want to add more layer on top of
>> Cassandra. I also have done all of my application without Countandra and I
>> would like to continue this way.
>>
>> Furthermore there is a Cassandra modeling problem that I would like to
>> solve, and not just hide.
>>
>> Alain
>>
>>
>>
>> 2012/1/18 Lucas de Souza Santos 
>> >
>> > Why not http://www.countandra.org/
>> >
>> >
>> > ...
>>
>>
>


Re: Way to Cassandra File System

2015-03-24 Thread Jim Ancona
There's also Brisk (https://github.com/riptano/brisk), the original open
source version of CFS before Riptano/Datastax made it proprietary. It's
been moribund for years, but there does appear to be a fork with commits up
to 2013:
https://github.com/milliondreams/brisk

Jim

On Tue, Mar 24, 2015 at 9:55 AM, Tiwari, Tarun 
wrote:

>  Cool I think that helps.
>
>
>
> Regards
>
> Tarun
>
>
>
> *From:* Jonathan Lacefield [mailto:jlacefi...@datastax.com]
> *Sent:* Tuesday, March 24, 2015 6:39 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Way to Cassandra File System
>
>
>
> Hello,
>
>   CFS is a DataStax proprietary implementation of the Hadoop File System
> interface/abstract base class, sorry don't remember which of the top of my
> head.  You could create your own implementation if you do not want to use
> DataStax's CFS.  Or you could purchase DataStax Enterprise.
>
>   Hope this provides clarity for you.
>
> Thanks,
>
> Jonathan
>
>
> [image: datastax_logo.png]
>
> Jonathan Lacefield
>
> Director - Consulting, Americas | (404) 822 3487 | jlacefi...@datastax.com
>
>
>  
>
> 
>
> [image: linkedin.png] [image:
> facebook.png] [image: twitter.png]
> [image: g+.png]
> 
>  
>
>
>
> On Tue, Mar 24, 2015 at 9:01 AM, Tiwari, Tarun 
> wrote:
>
> Hi All,
>
>
>
> In DSE they claim to have Cassandra File System in place of Hadoop which
> makes it real fault tolerant.
>
> Is there a way to use Cassandra file system CFS in place of HDFS if I
> don’t have DSE?
>
>
>
> Regards,
>
> *Tarun Tiwari* | Workforce Analytics-ETL | *Kronos India*
>
> M: +91 9540 28 27 77 | Tel: +91 120 4015200
>
> Kronos | Time & Attendance • Scheduling • Absence Management • HR &
> Payroll • Hiring • Labor Analytics
>
> *Join Kronos on: **kronos.com* * | **Facebook*
> * | **Twitter*
> * | **LinkedIn*
> * | **YouTube*
> 
>
>
>
>
>


Re: What % of cassandra developers are employed by Datastax?

2014-05-23 Thread Jim Ancona
I took a look at the Ohloh stats here:
https://www.ohloh.net/p/cassandra/contributors/summary

Note that committers are not the same as contributors. Dozens of people
contribute patches that are committed to the codebase without being
committers.

Over the last year, the top four contributors (Jonathan Ellis, Sylvain
Lebresne, Brandon Williams and Aleksey Yeschenko) were all Datastax
employees. Together they were responsible for 70% of the commits over that
time period. FWIW, the numbers for the last 30 days show those same top
four accounting for 52%.

I agree with Dave that we should be careful about what conclusions we draw
from this kind of data.

Jim



On Sat, May 17, 2014 at 11:45 AM, Jack Krupansky wrote:

>   I would note that the original question was about “developers”, not
> “committers” per se. I sort of assumed that the question implied the
> latter, but that’s not necessarily true. One can “develop” and optionally
> “contribute” code without being a committer, per se. There are probably
> plenty of users of Cassandra out there who do their own enhancement of
> Cassandra and don’t necessarily want or have the energy to contribute back
> their enhancements, or intend to and haven’t gotten around to it yet. And
> there are also “contributors” who have “developed” and “contributed”
> patches (ANYBODY can do that, not just “committers”) but are not officially
> anointed as “committers”.
>
> So, who knows how many contributors or “developers” are out there beyond
> the known committers. The important thing is that Cassandra is open source
> and licensed so that any enterprise can use it and readily and freely debug
> and enhance it without any sort of mandatory requirement that they be
> completely dependent on some particular vendor.
>
> There’s actually a wiki detailing some of the other vendors, beyond
> DataStax, who provide consulting (which may include actual Cassandra
> enhancement in some cases) and support for Cassandra:
> http://wiki.apache.org/cassandra/ThirdPartySupport
>
> (For disclosure, I am a part-time contractor for DataStax, but now on the
> sales side, although by background is as a developer.)
>
> -- Jack Krupansky
>
>  *From:* Dave Brosius 
> *Sent:* Saturday, May 17, 2014 10:48 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: What % of cassandra developers are employed by Datastax?
>
> The question assumes that it's likely that datastax employees become
> committers.
>
> Actually, it's more likely that committers become datastax employees.
>
> So this underlying tone that datastax only really 'wants' datastax
> employees to be cassandra committers, is really misleading.
>
> Why wouldn't a company want to hire people who have shown a desire and
> aptitude to work on products that they care about? It's just rational. And
> damn genius, actually.
>
> I'm sure they'd be happy to have an influx of non-datastax committers.
> patches welcome.
>
> dave
>
>
> On 05/17/2014 08:28 AM, Peter Lin wrote:
>
>
> if you look at the new committers since 2012 they are mostly datastax
>
>
> On Fri, May 16, 2014 at 9:14 PM, Kevin Burton  wrote:
>
>> so 30%… according to that data.
>>
>>
>> On Thu, May 15, 2014 at 4:59 PM, Michael Shuler 
>> wrote:
>>
>>> On 05/14/2014 03:39 PM, Kevin Burton wrote:
>>>
 I'm curious what % of cassandra developers are employed by Datastax?

>>>
>>> http://wiki.apache.org/cassandra/Committers
>>>
>>> --
>>> Kind regards,
>>> Michael
>>>
>>
>>
>>
>> --
>>  Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ 
>> profile
>>  
>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
>> people.
>>
>
>
>
>


Re: is there a SSTAbleInput for Map/Reduce instead of ColumnFamily?

2013-09-06 Thread Jim Ancona
Unfortunately, Netflix doesn't seem to have released Aegisthus as open
source.

Jim


On Fri, Aug 30, 2013 at 1:44 PM, Jeremiah D Jordan <
jeremiah.jor...@gmail.com> wrote:

> FYI:
> http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html
>
> -Jeremiah
>
> On Aug 30, 2013, at 9:21 AM, "Hiller, Dean"  wrote:
>
> > is there a SSTableInput for Map/Reduce instead of ColumnFamily (which
> uses thrift)?
> >
> > We are not worried about repeated reads since we are idempotent but
> would rather have the direct speed (even if we had to read from a snapshot,
> it would be fine).
> >
> > (We would most likely run our M/R on 4 nodes of the 12 nodes we have
> since we have RF=3 right now).
> >
> > Thanks,
> > Dean
>
>


Re: vnodes ready for production ?

2013-06-19 Thread Jim Ancona
On Tue, Jun 18, 2013 at 4:04 AM, aaron morton  wrote:
>> Even more if we could automate some up-scale thanks to AWS alarms, It
>> would be awesome.
>
> I saw a demo for Priam (https://github.com/Netflix/Priam) doing that at
> netflix in March, not sure if it's public yet.
>
>> Are the vnodes feature and the tokens =>vnodes transition safe enough to
>> go live with vnodes ?
>
> There have been some issues, search the user list for shuffle and as always
> test.
>
>> Any advice about vnodes ?
>
> They are in use out there. It's a sizable change so it would be good idea to
> build a test system for running shuffle and testing your application. There
> have been some issues with repair and range scans (including hadoop
> integration.)

Also, in his presentation at last week's Summit, Eric Evans suggested
not using shuffle. As an alternative he suggested removing and
replacing nodes one-by-one.

Jim

>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 18/06/2013, at 7:04 PM, Alain RODRIGUEZ  wrote:
>
> Any insights on vnodes, one month after my original post ?
>
>
> 2013/5/16 Alain RODRIGUEZ 
>>
>> Hi,
>>
>> Adding vnodes is a big improvement to Cassandra, specifically because we
>> have a fluctuating load on our Cassandra depending on the week, and it is
>> quite annoying to add some nodes for one week or two, move tokens and then
>> having to remove them and then move tokens again. Even more if we could
>> automate some up-scale thanks to AWS alarms, It would be awesome.
>>
>> We don't use vnodes yet because Opscenter did not support this feature and
>> because we need to have a reliable production. Now Opscenter handles vnodes.
>>
>> Are the vnodes feature and the tokens =>vnodes transition safe enough to
>> go live with vnodes ?
>>
>> What would be the transition process ?
>>
>> Does someone auto-scale his Cassandra cluster ?
>>
>> Any advice about vnodes ?
>
>
>


nodetool cfstats and compression

2012-09-14 Thread Jim Ancona
Do the row size stats reported by 'nodetool cfstats' include the
effect of compression?

Thanks,

Jim


Re: What determines the memory that used by key cache??

2012-06-18 Thread Jim Ancona
On Mon, Jun 18, 2012 at 8:53 AM, mich.hph  wrote:

> Dear all!
> In my cluster, I found every key needs 192bytes in the key cache.So I want
> to know what determines the memory that used by key cache. How to calculate
> the value.
>

According to
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Calculate-memory-used-for-keycache-tp6170528p6170814.htmlthe
formula is:

key cache memory use =
  * (<8 bytes for position i.e. value> +  + <16 bytes for token (RP)> + <8 byte reference for DecoratedKey> + <8
bytes for descriptor reference>)
which simplifies to
  * ( + 40)

Are your row keys 152 bytes?

Jim


> Thanks in advance.
>


Re: Cassandra error while processing message

2012-06-15 Thread Jim Ancona
It's hard to tell exactly what happened--are there other messages in your
client log before the "All host pools marked down"? Also, how many nodes
are there in your cluster? I suspect that the Thrift protocol error was
(incorrectly) retried by Hector, leading to the "All host pools marked
down", but without more info that's just a guess.

Jim

On Thu, Jun 14, 2012 at 4:48 AM, Tiwari, Dushyant <
dushyant.tiw...@morganstanley.com> wrote:

>  Hector : 1.0.0.1
>
> Cassandra: 1.0.3
>
> ** **
>
>
>
> *From:* Tiwari, Dushyant (ISGT)
> *Sent:* Thursday, June 14, 2012 2:16 PM
> *To:* user@cassandra.apache.org
> *Subject:* Cassandra error while processing message
>
> ** **
>
> Hi All,
>
> ** **
>
> Help needed on the following front. 
>
> ** **
>
> In my Cassandra node logs I can see the following error:
>
> ** **
>
> CustomTThreadPoolServer.java (line 201) Thrift error occurred during
> processing of message.
>
> org.apache.thrift.protocol.TProtocolException: Missing version in
> readMessageBegin, old client?
>
> at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:213)
> 
>
> at
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2877)
> 
>
> at
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
> 
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 
>
> at java.lang.Thread.run(Thread.java:619)
>
> ** **
>
> In Hector client :
>
> ** **
>
> Caused by: me.prettyprint.hector.api.exceptions.HectorException: All host
> pools marked down. Retry burden pushed out to client.
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:343)
> 
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:225)
> 
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
> 
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:102)
> 
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:108)
> 
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.model.MutatorImpl$3.doInKeyspace(MutatorImpl.java:248)
> 
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.model.MutatorImpl$3.doInKeyspace(MutatorImpl.java:245)
> 
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
> 
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
> 
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:245)**
> **
>
> ** **
>
> ** **
>
> After some time a null pointer exception
>
> Caused by: java.lang.NullPointerException
>
> [gsc][5/8454]   at
> me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)**
> **
>
> ** **
>
> Can someone please explain what is happening and how can I rectify it.
>
> ** **
>
> ** **
>
> Dushyant
>  --
>  NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
> Street Reform and Consumer Protection Act. If you have received this
> communication in error, please destroy all electronic and paper copies and
> notify the sender immediately. Mistransmission is not intended to waive
> confidentiality or privilege. Morgan Stanley reserves the right, to the
> extent permitted under applicable law, to monitor electronic
> communications. This message is subject to terms available at the following
> link: http://www.morganstanley.com/disclaimers. If you cannot access
> these links, please notify us by reply message and we will send the
> contents to you. By messaging with Morgan Stanley you consent to the
> foregoing.
>


Re: Secondary Indexes, Quorum and Cluster Availability

2012-06-07 Thread Jim Ancona
On Thu, Jun 7, 2012 at 5:41 AM, aaron morton wrote:

> Sounds good. Do you want to make the change ?
>
Done.

>
> Thanks for taking the time.
>
Thanks for giving the answer!

Jim


>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 7/06/2012, at 7:54 AM, Jim Ancona wrote:
>
> On Tue, Jun 5, 2012 at 4:30 PM, Jim Ancona  wrote:
>
>> It might be a good idea for the documentation to reflect the tradeoffs
>> more clearly.
>
>
> Here's a proposed addition to the Secondary Index FAQ at
> http://wiki.apache.org/cassandra/SecondaryIndexes
>
> Q: How does choice of Consistency Level affect cluster availability when
> using secondary indexes?
> A: Because secondary indexes are distributed, you must have CL level nodes
> available for *all* token ranges in the cluster in order to complete a
> query. For example, with RF = 3, when two out of three consecutive nodes in
> the ring are unavailable, *all* secondary index queries at CL = QUORUM
> will fail, however secondary index queries at CL = ONE will succeed. This
> is true regardless of cluster size.
>
> Comments?
>
> Jim
>
>
>


Re: Secondary Indexes, Quorum and Cluster Availability

2012-06-06 Thread Jim Ancona
On Tue, Jun 5, 2012 at 4:30 PM, Jim Ancona  wrote:

> It might be a good idea for the documentation to reflect the tradeoffs
> more clearly.


Here's a proposed addition to the Secondary Index FAQ at
http://wiki.apache.org/cassandra/SecondaryIndexes

Q: How does choice of Consistency Level affect cluster availability when
using secondary indexes?
A: Because secondary indexes are distributed, you must have CL level nodes
available for *all* token ranges in the cluster in order to complete a
query. For example, with RF = 3, when two out of three consecutive nodes in
the ring are unavailable, *all* secondary index queries at CL = QUORUM will
fail, however secondary index queries at CL = ONE will succeed. This is
true regardless of cluster size.

Comments?

Jim


Re: Secondary Indexes, Quorum and Cluster Availability

2012-06-05 Thread Jim Ancona
On Mon, Jun 4, 2012 at 2:34 PM, aaron morton wrote:

> IIRC index slices work a little differently with consistency, they need to
> have CL level nodes available for all token ranges. If you drop it to CL
> ONE the read is local only for a particular token range.
>

Yes, this is what we observed. When I reasoned my way through what I knew
about how secondary indexes work, I came to the same conclusion about all
token ranges having to be available.

My surprise at the behavior was because I *hadn't* reasoned my way through
it until we had the issue. Somehow I doubt I'm the only user of secondary
indexes that was unaware of this ramification of CL choice. It might be a
good idea for the documentation to reflect the tradeoffs more clearly.

Thanks for you help!

Jim


>
> The problem when doing index reads is the nodes that contain the results
> can no longer be selected by the partitioner.
>

> Cheers
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 2/06/2012, at 5:15 AM, Jim Ancona wrote:
>
> Hi,
>
> We have an application with two code paths, one of which uses a secondary
> index query and the other, which doesn't. While testing node down scenarios
> in our cluster we got a result which surprised (and concerned) me, and I
> wanted to find out if the behavior we observed is expected.
>
> Background:
>
>- 6 nodes in the cluster (in order: A, B, C, E, F and G)
>- RF = 3
>- All operations at QUORUM
>- Operation 1: Read by row key followed by write
>- Operation 2: Read by secondary index, followed by write
>
> While running a mixed workload of operations 1 and 2, we got the following
> results:
>
>  * Scenario* * Result* All nodes up All operations succeed One node downAll 
> operations succeedNodes A and E downAll operations succeedNodes A and B 
> downOperation 1: ~33% fail
> Operation 2: All fail Nodes A and C down Operation 1: ~17% fail
> Operation 2: All fail
> We had expected (perhaps incorrectly) that the secondary index reads would
> fail in proportion to the portion of the ring that was unable to reach
> quorum, just as the row key reads did. For both operation types the
> underlying failure was an UnavailableException.
>
> The same pattern repeated for the other scenarios we tried. The row key
> operations failed at the expected ratios, given the portion of the ring
> that was unable to meet quorum because of nodes down, while all the
> secondary index reads failed as soon as 2 out of any 3 adjacent nodes were
> down.
>
> Is this an expected behavior? Is it documented anywhere? I didn't find it
> with a quick search.
>
> The operation doing secondary index query is an important one for our app,
> and we'd really prefer that it degrade gracefully in the face of cluster
> failures. My plan at this point is to do that query at ConsistencyLevel.ONE
> (and accept the increased risk of inconsistency). Will that work?
>
> Thanks in advance,
>
> Jim
>
>
>


Re: Cassandra 1.1.1 release?

2012-06-02 Thread Jim Ancona
The release vote is going on now on the dev list. So probably in the next
day or two, assuming no problems pop up.

Jim

On Wed, May 30, 2012 at 1:29 PM, Roland Mechler  wrote:

> Anyone have a rough idea of when Cassandra 1.1.1 is likely to be released?
>
> -Roland
>
>


Secondary Indexes, Quorum and Cluster Availability

2012-06-01 Thread Jim Ancona
Hi,

We have an application with two code paths, one of which uses a secondary
index query and the other, which doesn't. While testing node down scenarios
in our cluster we got a result which surprised (and concerned) me, and I
wanted to find out if the behavior we observed is expected.

Background:

   - 6 nodes in the cluster (in order: A, B, C, E, F and G)
   - RF = 3
   - All operations at QUORUM
   - Operation 1: Read by row key followed by write
   - Operation 2: Read by secondary index, followed by write

While running a mixed workload of operations 1 and 2, we got the following
results:

 * Scenario* * Result* All nodes up All operations succeed One node
downAll operations succeedNodes A and E downAll operations
succeedNodes A and B downOperation 1: ~33% fail
Operation 2: All fail Nodes A and C down Operation 1: ~17% fail
Operation 2: All fail
We had expected (perhaps incorrectly) that the secondary index reads would
fail in proportion to the portion of the ring that was unable to reach
quorum, just as the row key reads did. For both operation types the
underlying failure was an UnavailableException.

The same pattern repeated for the other scenarios we tried. The row key
operations failed at the expected ratios, given the portion of the ring
that was unable to meet quorum because of nodes down, while all the
secondary index reads failed as soon as 2 out of any 3 adjacent nodes were
down.

Is this an expected behavior? Is it documented anywhere? I didn't find it
with a quick search.

The operation doing secondary index query is an important one for our app,
and we'd really prefer that it degrade gracefully in the face of cluster
failures. My plan at this point is to do that query at ConsistencyLevel.ONE
(and accept the increased risk of inconsistency). Will that work?

Thanks in advance,

Jim


Re: single row key continues to grow, should I be concerned?

2012-03-23 Thread Jim Ancona
I'm dealing with a similar issue, with an additional complication. We are
collecting time series data, and the amount of data per time period varies
greatly. We will collect and query event data by account, but the biggest
account will accumulate about 10,000 times as much data per time period as
the median account. So for the median account I could put multiple years of
data in one row, while for the largest accounts I don't want to put more
one day's worth in a row. If I use a uniform bucket size of one day (to
accomodate the largest accounts) it will make for rows that are too short
for the vast majority of accounts--we would have to read thirty rows to get
a month's worth of data. One obvious approach is to set a maximum row size,
that is, write data in a row until it reaches a maximum length, then start
a new one. There are two things that make that harder than it sounds:

   1. There's no efficient way to count columns in a Cassandra row in order
   to find out when to start a new one.
   2. Row keys aren't searchable. So I need to be able to construct or look
   up the key to each row that contains a account's data. (Our data will be in
   reverse date order.)

Possible solutions:

   1. Cassandra counter columns are an efficient way to keep counts
   2. I could have a "directory" row that contains pointers to the rows
   that contain an account data

(I could probably combine the row directory and the column counter into a
single counter column family, where the column name is the row key and the
value is the counter.) A naive solution would require reading the directory
before every read and the counter before every write--caching could
probably help with that. So this approach would probably lead to a
reasonable solution, but it's liable to be somewhat complex. Before I go
much further down this path, I thought I'd run it by this group in case
someone can point out a more clever solution.

Thanks,

Jim
On Thu, Mar 22, 2012 at 5:36 PM, Alexandru Sicoe  wrote:

> Thanks Aaron, I'll lower the time bucket, see how it goes.
>
> Cheers,
> Alex
>
>
> On Thu, Mar 22, 2012 at 10:07 PM, aaron morton wrote:
>
>> Will adding a few tens of wide rows like this every day cause me problems
>> on the long term? Should I consider lowering the time bucket?
>>
>> IMHO yeah, yup, ya and yes.
>>
>>
>> From experience I am a bit reluctant to create too many rows because I
>> see that reading across multiple rows seriously affects performance. Of
>> course I will use map-reduce as well ...will it be significantly affected
>> by many rows?
>>
>> Don't think it would make too much difference.
>> range slice used by map-reduce will find the first row in the batch and
>> then step through them.
>>
>> Cheers
>>
>>
>>   -
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 22/03/2012, at 11:43 PM, Alexandru Sicoe wrote:
>>
>> Hi guys,
>>
>> Based on what you are saying there seems to be a tradeoff that developers
>> have to handle between:
>>
>>"keep your rows under a certain size" vs
>> "keep data that's queried together, on disk together"
>>
>> How would you handle this tradeoff in my case:
>>
>> I monitor about 40.000 independent timeseries streams of data. The
>> streams have highly variable rates. Each stream has its own row and I go to
>> a new row every 28 hrs. With this scheme, I see several tens of rows
>> reaching sizes in the millions of columns within this time bucket (largest
>> I saw was 6.4 million). The sizes of these wide rows are around 400MBytes
>> (considerably > than 60MB)
>>
>> Will adding a few tens of wide rows like this every day cause me problems
>> on the long term? Should I consider lowering the time bucket?
>>
>> From experience I am a bit reluctant to create too many rows because I
>> see that reading across multiple rows seriously affects performance. Of
>> course I will use map-reduce as well ...will it be significantly affected
>> by many rows?
>>
>> Cheers,
>> Alex
>>
>> On Tue, Mar 20, 2012 at 6:37 PM, aaron morton wrote:
>>
>>> The reads are only fetching slices of 20 to 100 columns max at a time
>>> from the row but if the key is planted on one node in the cluster I am
>>> concerned about that node getting the brunt of traffic.
>>>
>>> What RF are you using, how many nodes are in the cluster, what CL do you
>>> read at ?
>>>
>>> If you have lots of nodes that are in different racks the
>>> NetworkTopologyStrategy will do a better job of distributing read load than
>>> the SimpleStrategy. The DynamicSnitch can also result distribute load, see
>>> cassandra yaml for it's configuration.
>>>
>>> I thought about breaking the column data into multiple different row
>>> keys to help distribute throughout the cluster but its so darn handy having
>>> all the columns in one key!!
>>>
>>> If you have a row that will continually grow it is a good idea to
>>> partition it in some way. Large rows can slow things 

Re: yet a couple more questions on composite columns

2012-02-06 Thread Jim Ancona
On Sat, Feb 4, 2012 at 8:54 PM, Yiming Sun  wrote:
> Interesting idea, Jim.  Is there a reason you don't you use
> "metadata:{accountId}" instead?  For performance reasons?

No, because the column comparator is defined as
CompositeType(DateType, AsciiType), and all column names must conform
to that.

Jim

>
>
> On Sat, Feb 4, 2012 at 6:24 PM, Jim Ancona  wrote:
>>
>> I've used "special" values which still comply with the Composite
>> schema for the metadata columns, e.g. a column of
>> 1970-01-01:{accountId} for a metadata column where the Composite is
>> DateType:UTF8Type.
>>
>> Jim
>>
>> On Sat, Feb 4, 2012 at 2:13 PM, Yiming Sun  wrote:
>> > Thanks Andrey and Chris.  It sounds like we don't necessarily have to
>> > use
>> > composite columns.  From what I understand about dynamic CF, each row
>> > may
>> > have completely different data from other rows;  but in our case, the
>> > data
>> > in each row is similar to other rows; my concern was more about the
>> > homogeneity of the data between columns.
>> >
>> > In our original supercolumn-based schema, one special supercolumn is
>> > called
>> > "metadata" which contains a number of subcolumns to hold metadata
>> > describing
>> > each collection (e.g. number of documents, etc.), then the rest of the
>> > supercolumns in the same row are all IDs of documents belong to the
>> > collection, and for each document supercolumn, the subcolumns contain
>> > the
>> > document content as well as metadata on individual document (e.g.
>> > checksum
>> > of each document).
>> >
>> > To move away from the supercolumn schema, I could either create two CFs,
>> > one
>> > to hold metadata, the other document content; or I could create just one
>> > CF
>> > mixing metadata and doc content in the same row, and using composite
>> > column
>> > names to identify if the particular column is metadata or a document.  I
>> > am
>> > just wondering if you have any inputs on the pros and cons of each
>> > schema.
>> >
>> > -- Y.
>> >
>> >
>> > On Fri, Feb 3, 2012 at 10:27 PM, Chris Gerken
>> > 
>> > wrote:
>> >>
>> >>
>> >>
>> >>
>> >> On 4 February 2012 06:21, Yiming Sun  wrote:
>> >>>
>> >>> I cannot have one composite column name with 3 components while
>> >>> another
>> >>> with 4 components?
>> >>
>> >>  Just put 4 components and left last empty (if it is same type)?!
>> >>
>> >>> Another question I have is how flexible composite columns actually
>> >>> are.
>> >>>  If my data model has a CF containing US zip codes with the following
>> >>> composite columns:
>> >>>
>> >>> {OH:Spring Field} : 45503
>> >>> {OH:Columbus} : 43085
>> >>> {FL:Spring Field} : 32401
>> >>> {FL:Key West}  : 33040
>> >>>
>> >>> I know I can ask cassandra to "give me the zip codes of all cities in
>> >>> OH".  But can I ask it to "give me the zip codes of all cities named
>> >>> Spring
>> >>> Field" using this model?  Thanks.
>> >>
>> >> No. You set first composite component at first.
>> >>
>> >>
>> >> I'd use a dynamic CF:
>> >> row key = state abbreviation
>> >> column name = city name
>> >> column value = zip code (or a complex object, one of whose properties
>> >> is
>> >> zip code)
>> >>
>> >> you can iterate over the columns in a single row to get a state's city
>> >> names and their zip code and you can do a get_range_slices on all keys
>> >> for
>> >> the columns starting and ending on the city name to find out the zip
>> >> codes
>> >> for a cities with the given name.
>> >>
>> >> I think
>> >>
>> >> - Chris
>> >
>> >
>
>


Re: yet a couple more questions on composite columns

2012-02-04 Thread Jim Ancona
I've used "special" values which still comply with the Composite
schema for the metadata columns, e.g. a column of
1970-01-01:{accountId} for a metadata column where the Composite is
DateType:UTF8Type.

Jim

On Sat, Feb 4, 2012 at 2:13 PM, Yiming Sun  wrote:
> Thanks Andrey and Chris.  It sounds like we don't necessarily have to use
> composite columns.  From what I understand about dynamic CF, each row may
> have completely different data from other rows;  but in our case, the data
> in each row is similar to other rows; my concern was more about the
> homogeneity of the data between columns.
>
> In our original supercolumn-based schema, one special supercolumn is called
> "metadata" which contains a number of subcolumns to hold metadata describing
> each collection (e.g. number of documents, etc.), then the rest of the
> supercolumns in the same row are all IDs of documents belong to the
> collection, and for each document supercolumn, the subcolumns contain the
> document content as well as metadata on individual document (e.g. checksum
> of each document).
>
> To move away from the supercolumn schema, I could either create two CFs, one
> to hold metadata, the other document content; or I could create just one CF
> mixing metadata and doc content in the same row, and using composite column
> names to identify if the particular column is metadata or a document.  I am
> just wondering if you have any inputs on the pros and cons of each schema.
>
> -- Y.
>
>
> On Fri, Feb 3, 2012 at 10:27 PM, Chris Gerken 
> wrote:
>>
>>
>>
>>
>> On 4 February 2012 06:21, Yiming Sun  wrote:
>>>
>>> I cannot have one composite column name with 3 components while another
>>> with 4 components?
>>
>>  Just put 4 components and left last empty (if it is same type)?!
>>
>>> Another question I have is how flexible composite columns actually are.
>>>  If my data model has a CF containing US zip codes with the following
>>> composite columns:
>>>
>>> {OH:Spring Field} : 45503
>>> {OH:Columbus} : 43085
>>> {FL:Spring Field} : 32401
>>> {FL:Key West}  : 33040
>>>
>>> I know I can ask cassandra to "give me the zip codes of all cities in
>>> OH".  But can I ask it to "give me the zip codes of all cities named Spring
>>> Field" using this model?  Thanks.
>>
>> No. You set first composite component at first.
>>
>>
>> I'd use a dynamic CF:
>> row key = state abbreviation
>> column name = city name
>> column value = zip code (or a complex object, one of whose properties is
>> zip code)
>>
>> you can iterate over the columns in a single row to get a state's city
>> names and their zip code and you can do a get_range_slices on all keys for
>> the columns starting and ending on the city name to find out the zip codes
>> for a cities with the given name.
>>
>> I think
>>
>> - Chris
>
>


cassandra-cli: Create column family with composite column name

2011-10-05 Thread Jim Ancona
Using Cassandra 0.8.6, I've been trying to figure out how to use the
CLI to create column families using composite keys and column names.
The documentation on CompositeType seems pretty skimpy. But in the
course of writing this email to ask how to do it, I figured out the
proper syntax. In the hope of making it easier for the next person, I
repurposed this message to document what I figured out. I'll also
update the wiki. Here is the syntax:

create column family MyCF
with key_validation_class = 'CompositeType(UTF8Type, IntegerType)'
and comparator = 'CompositeType(DateType(reversed=true), UTF8Type)'
and default_validation_class='CompositeType(UTF8Type, DateType)'
and column_metadata=[
{ column_name:'0:my Column Name', validation_class:LongType,
index_type:KEYS}
];

One weakness of this syntax is that there doesn't seem to be a way to
escape a ':' in a composite value. There's a FIXME in the code to that
effect.

Jim


Re: Update of column sometimes takes 10 seconds

2011-09-26 Thread Jim Ancona
Do you actually see the update occur if you wait for 10 seconds (as your
subject implies), or do you just see intermittent failures when running the
unit test? If it's the latter, are you sure that the update has a greater
timestamp than the insert? I've seen similar unit tests fail because because
the timestamp values were the same.

Jim

On Mon, Sep 26, 2011 at 2:55 PM, Rick Whitesel (rwhitese) <
rwhit...@cisco.com> wrote:

> Hi All:
>
> ** **
>
> We have a simple junit test that inserts a column, immediately updates that
> column and then validates that the data updated. Cassandra is run embedded
> in the unit test. Sometimes the test will pass, i.e. the updated data is
> correct, and sometimes the test will fail. The configuration is set to:***
> *
>
> ** **
>
> periodic and
> 1
>
> ** **
>
> We are running version 0.6.9. We plan to update to the latest version but
> cannot until after the release we are wrapping up. We are using the client
> batch mutate to create and update the data. From what I understand, the
> commit log write will return immediately and the data will be store in
> memory. If that is the case, then why would our test sometimes fail?
>
> ** **
>
> -Rick Whitesel
>
> ** **
>
>  
>
> *Simplify, Standardize and Conserve*
>
> ** **
>
> ** **
>
> [image: cid:01608604-FE45-42D2-AACA-4C66FCE5AB8C@cisco.com]
>
> ** **
>
> *Rick Whitesel*
> *Technical Lead*
> *Customer Contact Business Unit**
> *
> rwhit...@cisco.com 
> Phone :*978-936-0479*
>
>
> 500 Beaver Brook Road
> Boxboro, MA 01719
> Mailing address:
> 1414 Massachusetts Avenue
> Boxboro, MA 01719
> United States
> www.cisco.com
>
> ** **
>
> ** **
>
<>

Re: TransportException when storing large values

2011-09-20 Thread Jim Ancona
Pete,

See this thread
http://groups.google.com/group/hector-users/browse_thread/thread/cb3e72c85dbdd398/82b18ffca0e3940a?#82b18ffca0e3940a
for a bit more info.

Jim

On Tue, Sep 20, 2011 at 9:02 PM, Tyler Hobbs  wrote:
> From cassandra.yaml:
>
> # Frame size for thrift (maximum field length).
> # 0 disables TFramedTransport in favor of TSocket. This option
> # is deprecated; we strongly recommend using Framed mode.
> thrift_framed_transport_size_in_mb: 15
>
> So you can either increase that limit, or split your write into multiple
> operations.
>
> On Tue, Sep 20, 2011 at 4:26 PM, Pete Warden  wrote:
>>
>> I'm running into consistent problems when storing values larger than 15MB
>> into Cassandra, and I was hoping for some advice on tracking down what's
>> going wrong. From the FAQ it seems like what I'm trying to do is possible,
>> so I assume I'm messing something up with my configuration. I have a minimal
>> set of code to reproduce the issue below, which I've run on the DataStax
>> 0.8.1 AMI I'm using in production (ami-9996c4dc)
>> # To set up the test data structure on Cassandra:
>> cassandra-cli
>> connect localhost/9160;
>> create keyspace TestKeyspace with
>>   placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and
>>   strategy_options = [{replication_factor:3}];
>> use TestKeyspace;
>> create column family TestFamily with
>>   comparator = UTF8Type and
>>   column_metadata =
>>   [
>>     {column_name: test_column, validation_class: UTF8Type},
>>   ];
>> # From bash on the same machine, with Ruby and the Cassandra gem
>> installed:
>> irb
>> require 'rubygems'
>> require 'cassandra/0.8'
>> client = Cassandra.new('TestKeyspace', 'localhost:9160', :retries => 5,
>> :connect_timeout => 5, :timeout => 10)
>> # With data this size, the call works
>> column_value = 'a' * (14*1024*1024)
>> row_value = { 'column_name' => column_value }
>> client.insert(:TestFamily, 'SomeKey', row_value)
>>
>> # With data this size, the call fails with the exception below
>> column_value = 'a' * (15*1024*1024)
>> row_value = { 'column_name' => column_value }
>> client.insert(:TestFamily, 'SomeKey', row_value)
>> # Results:
>> This first call with a 14MB chunk of data succeeds, but the second one
>> fails with this exception:
>> CassandraThrift::Cassandra::Client::TransportException:
>> CassandraThrift::Cassandra::Client::TransportException
>> from
>> /usr/lib/ruby/gems/1.8/gems/thrift-0.7.0/lib/thrift/transport/socket.rb:53:in
>> `open'
>> from
>> /usr/lib/ruby/gems/1.8/gems/thrift-0.7.0/lib/thrift/transport/framed_transport.rb:37:in
>> `open'
>> from
>> /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/connection/socket.rb:11:in
>> `connect!'
>> from
>> /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/abstract_thrift_client.rb:105:in
>> `connect!'
>> from
>> /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/abstract_thrift_client.rb:144:in
>> `handled_proxy'
>> from
>> /usr/lib/ruby/gems/1.8/gems/thrift_client-0.7.1/lib/thrift_client/abstract_thrift_client.rb:60:in
>> `batch_mutate'
>> from
>> /usr/lib/ruby/gems/1.8/gems/cassandra-0.12.1/lib/cassandra/protocol.rb:7:in
>> `_mutate'
>> from
>> /usr/lib/ruby/gems/1.8/gems/cassandra-0.12.1/lib/cassandra/cassandra.rb:459:in
>> `insert'
>> from (irb):6
>> from :0
>> Any suggestions on how to dig deeper? I'll be reaching out to the
>> Cassandra gem folks, etc too of course.
>> cheers,
>>            Pete
>
>
> --
> Tyler Hobbs
> Software Engineer, DataStax
> Maintainer of the pycassa Cassandra Python client library
>
>


Re: what's the difference between repair CF separately and repair the entire node?

2011-09-12 Thread Jim Ancona
On Mon, Sep 12, 2011 at 1:44 PM, Peter Schuller
 wrote:
>> I am using 0.7.4.  so it is always okay to do the routine repair on
>> Column Family basis? thanks!
>
> It's "okay" but won't do what you want; due to a bug you'll see
> streaming of data for other column families than the one you're trying
> to repair. This will be fixed in 1.0.

I think we might be running into this. Is CASSANDRA-2280 the issue
you're referring to?

Jim


Re: Cassandra client loses connectivity to cluster

2011-09-06 Thread Jim Ancona
Since we finally fixed this issue, I thought I'd document the
solution, with the hope that it makes it easier for others who might
run into it.

During the time this issue was occurring Anthony Ikeda reported a very
similar issue, although without the strange pattern of occurrences we
saw: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Trying-to-find-the-problem-with-a-broken-pipe-td6645526.html

It turns out that our problem was the same as Anthony's: exceeding
Thrift's maximum frame size, as set by
thrift_framed_transport_size_in_mb in cassandra.yaml. This problem was
extremely hard to troubleshoot, for the following reasons:

* TFramedTransport responds to an oversized frame by throwing a
TTransportException, which is a generic exception thrown for various
types of network or protocol errors. Because such errors are common,
many servers (TSimpleServer, TThreadPoolServer, and Cassandra's
CustomTThreadPoolServer) swallow TTransportException without a log
message. I've filed https://issues.apache.org/jira/browse/THRIFT-1323
and https://issues.apache.org/jira/browse/CASSANDRA-3142 to address
the lack of logging. (We finally found the issue by adding logging
code to our production Cassandra deploy. The fact that we could do
that it a big win for open source.)
* After the TTransportException occurs, the server closes the
underlying socket. To the client (Hector in our case), this appears as
broken socket, most likely caused by a network problem or failed
server node. Hector responds by marking the server node down and
retrying the too-large request on another node, where it also fails.
This process repeated leads to the entire cluster being marked down
(see https://github.com/rantav/hector/issues/212).
* Ideally, sending an oversized frame should trigger a recognizable
error on the client, so that the client knows that it has made a error
and avoids compounding the mistake by retrying. Thrift's framed
transport is pretty simple and I assume there isn't a good way for the
server to communicate the error to the client. As a second-best
solution, I've logged a bug against Thrift
(https://issues.apache.org/jira/browse/THRIFT-1324) saying that
TFramedTransport should enforce the configured frame size limit on
writes. At least that way people can avoid the issue by configuring a
client frame size to match their servers'. If that is implemented then
clients like Hector will be able to detect the frame too large case
and return an error instead of retrying it.

In addition to the issues above, some other things made this issue
difficult to solve:
* The pattern of occurrences (only occurring at a certain time of day,
on a single server at a time, only on weekdays, etc.) was something of
a distraction.
* Even after finding out that Anthony's similar problem was caused by
an oversized frame, I was convinced that we could not be generating an
operation large enough to exceed the configured value (15 mb).

It turns out that I was almost right: out of our hundreds of thousands
of customers, exactly one was working with data that large, and that
customer was doing so not because of anomalous behavior on their part,
but because of a bug in our system. So the fact that it was a single
customer explained the regular occurrences, and the bug explained the
unexpectedly large data size. Of course in this case "almost right"
wasn't good enough, my BOTE calculation failed to take the bug into
account. Plus, as I tweeted immediately after I figured out what was
going on, "Lesson: when you have millions of users it becomes easier
to say things about averages, but harder to do the same for extremes."

Jim

On Wed, Jun 29, 2011 at 5:42 PM, Jim Ancona  wrote:
> In reviewing client logs as part of our Cassandra testing, I noticed
> several Hector "All host pools marked down" exceptions in the logs.
> Further investigation showed a consistent pattern of
> "java.net.SocketException: Broken pipe" and "java.net.SocketException:
> Connection reset" messages. These errors occur for all 36 hosts in the
> cluster over a period of seconds, as Hector tries to find a working
> host to connect to. Failing to find a host results in the "All host
> pools marked down" messages. These messages recur for a period ranging
> from several seconds up to almost 15 minutes, clustering around two to
> three minutes. Then connectivity returns and when Hector tries to
> reconnect it succeeds.
>
> The clients are instances of a JBoss 5 web application. We use Hector
> 0.7.0-29 (plus a patch that was pulled in advance of -30) The
> Cassandra cluster has 72 nodes split between two datacenters. It's
> running 0.7.5 plus a couple of bug fixes pulled in advance of 0.7.6.
> The keyspace uses NetworkTopologyStrategy and RF=6 (3 in each
> datacenter). The clients are reading 

Re: Professional Support

2011-09-06 Thread Jim Ancona
We use Datastax (http://www.datastax.com) and we have been very happy
with the support we've received.

We haven't tried any of the other providers on that page, so I can't
comment on them.

Jim
(Disclaimer: no connection with Datastax other than as a satisfied customer.)

On Tue, Sep 6, 2011 at 1:15 PM, China Stoffen  wrote:
> There is a link to a page which lists few professional support providers on
> Cassandra homepage. I have contacted few of them and couple are just out of
> providing support and others didn't reply. So, do you know about any
> professional support provider for Cassandra solutions and how much they
> charge per year?
>


Re: Updates lost

2011-08-31 Thread Jim Ancona
You could also look at Hector's approach in:
https://github.com/rantav/hector/blob/master/core/src/main/java/me/prettyprint/cassandra/service/clock/MicrosecondsSyncClockResolution.java

It works well and I believe there was some performance testing done on
it as well.

Jim

On Tue, Aug 30, 2011 at 3:43 PM, Jeremy Hanna
 wrote:
> Sorry - misread your earlier email.  I would login to IRC and ask in 
> #cassandra.  I would think given the nature of nanotime you'll run into 
> harder to track down problems, but it may be fine.
>
> On Aug 30, 2011, at 2:06 PM, Jiang Chen wrote:
>
>> Do you see any problem with my approach to derive the current time in
>> nano seconds though?
>>
>> On Tue, Aug 30, 2011 at 2:39 PM, Jeremy Hanna
>>  wrote:
>>> Yes - the reason why internally Cassandra uses milliseconds * 1000 is 
>>> because System.nanoTime javadoc says "This method can only be used to 
>>> measure elapsed time and is not related to any other notion of system or 
>>> wall-clock time."
>>>
>>> http://download.oracle.com/javase/6/docs/api/java/lang/System.html#nanoTime%28%29
>>>
>>> On Aug 30, 2011, at 1:31 PM, Jiang Chen wrote:
>>>
 Indeed it's microseconds. We are talking about how to achieve the
 precision of microseconds. One way is System.currentTimeInMillis() *
 1000. It's only precise to milliseconds. If there are more than one
 update in the same millisecond, the second one may be lost. That's my
 original problem.

 The other way is to derive from System.nanoTime(). This function
 doesn't directly return the time since epoch. I used the following:

       private static long nanotimeOffset = System.nanoTime()
                       - System.currentTimeMillis() * 100;

       private static long currentTimeNanos() {
               return System.nanoTime() - nanotimeOffset;
       }

 The timestamp to use is then currentTimeNanos() / 1000.

 Anyone sees problem with this approach?

 On Tue, Aug 30, 2011 at 2:20 PM, Edward Capriolo  
 wrote:
>
>
> On Tue, Aug 30, 2011 at 1:41 PM, Jeremy Hanna 
> wrote:
>>
>> I would not use nano time with cassandra.  Internally and throughout the
>> clients, milliseconds is pretty much a standard.  You can get into 
>> trouble
>> because when comparing nanoseconds with milliseconds as long numbers,
>> nanoseconds will always win.  That bit us a while back when we deleted
>> something and it couldn't come back because we deleted it with 
>> nanoseconds
>> as the timestamp value.
>>
>> See the caveats for System.nanoTime() for why milliseconds is a standard:
>>
>> http://download.oracle.com/javase/6/docs/api/java/lang/System.html#nanoTime%28%29
>>
>> On Aug 30, 2011, at 12:31 PM, Jiang Chen wrote:
>>
>>> Looks like the theory is correct for the java case at least.
>>>
>>> The default timestamp precision of Pelops is millisecond. Hence the
>>> problem as explained by Peter. Once I supplied timestamps precise to
>>> microsecond (using System.nanoTime()), the problem went away.
>>>
>>> I previously stated that sleeping for a few milliseconds didn't help.
>>> It was actually because of the precision of Java Thread.sleep().
>>> Sleeping for less than 15ms often doesn't sleep at all.
>>>
>>> Haven't checked the Python side to see if it's similar situation.
>>>
>>> Cheers.
>>>
>>> Jiang
>>>
>>> On Tue, Aug 30, 2011 at 9:57 AM, Jiang Chen  wrote:
 It's a single node. Thanks for the theory. I suspect part of it may
 still be right. Will dig more.

 On Tue, Aug 30, 2011 at 9:50 AM, Peter Schuller
  wrote:
>> The problem still happens with very high probability even when it
>> pauses for 5 milliseconds at every loop. If Pycassa uses microseconds
>> it can't be the cause. Also I have the same problem with a Java
>> client
>> using Pelops.
>
> You connect to localhost, but is that a single node or part of a
> cluster with RF > 1? If the latter, you need to use QUORUM consistency
> level to ensure that a read sees your write.
>
> If it's a single node and not a pycassa / client issue, I don't know
> off hand.
>
> --
> / Peter Schuller (@scode on twitter)
>

>>
>
> Isn't the standard microseconds ? (System.currentTimeMillis()*1000L)
> http://wiki.apache.org/cassandra/DataModel
> The CLI uses microseconds. If your code and the CLI are doing different
> things with time BadThingsWillHappen TM
>
>
>>>
>>>
>
>


Re: Trying to find the problem with a broken pipe

2011-08-02 Thread Jim Ancona
og files
> have no errors reported.
> What versions of Hector and Cassandra are you running?
> Cassandra 0.8.1, Hector 0.8.0-1

Our issue is occurring with Cassandra 0.7.8 and Hector 0.7-30. We plan
to deploy Hector 0.7-31 this week and to turn on useSocketKeepalive.
Are you using that? We're also using tcpdump to capture packets when
failures occur to see if there are anomalies in the network traffic.

Jim


>
>
>
> On Tue, Aug 2, 2011 at 10:37 AM, Jim Ancona  wrote:
>>
>> On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda
>>  wrote:
>> > I'm not sure if this is a problem with Hector or with Cassandra.
>> > We seem to be seeing broken pipe issues with our connections on the
>> > client
>> > side (Exception below). A bit of googling finds possibly a problem with
>> > the
>> > amount of data we are trying to store, although I'm certain our datasets
>> > are
>> > not all that large.
>>
>> I'm not sure what you're referring to here. Large requests could lead
>> to timeouts, but that's not what you're seeing here. Could you link to
>> the page you're referencing?
>>
>> > A nodetool ring command doesn't seem to present any downed nodes:
>> > Address         DC          Rack        Status State   Load
>> >  Owns
>> >    Token
>> >
>> >   153951716904446304929228999025275230571
>> > 10.130.202.34   datacenter1 rack1       Up     Normal  470.74 KB
>> > 79.19%  118538200848404459763384037192174096102
>> > 10.130.202.35   datacenter1 rack1       Up     Normal  483.63 KB
>> > 20.81%  153951716904446304929228999025275230571
>> >
>> > There are no errors in the cassandra server logs.
>> >
>> > Are there any particular timeouts on connections that we need to be
>> > aware
>> > of? Or perhaps configure on the Cassandra nodes? Is this purely and
>> > issue
>> > with the Hector API configuration?
>>
>> There is a server side timeout (rpc_timeout_in_ms in cassandra.yaml)
>> and a Hector client-side timeout
>> (CassandraHostConfigurator.cassandraThriftSocketTimeout). But again,
>> the "Broken pipe" error is not a timeout, it indicates that something
>> happened to the underlying network socket. For example you will see
>> those when a server node is restarted.
>>
>> Some questions that might help troubleshoot this:
>> How often are these occurring?
>> Does this affect both nodes in the cluster or just one?
>> Are the exceptions related to any external events (e.g. node restarts,
>> network issues...)?
>> What versions of Hector and Cassandra are you running?
>>
>> Keep in mind that failures like this will normally be retried by
>> Hector, resulting in no loss of data. For that reason, I think that
>> exception is logged as a warning in the newest Hector versions.
>>
>> We've seen something similar, but more catastrophic because it affects
>> connectivity to the entire cluster, not just a single node. See this
>> post for more details: http://goo.gl/hrgkw So far we haven't
>> identified the cause.
>>
>> Jim
>>
>> > Anthony
>> >
>> > 2011-08-02 08:43:06,541 ERROR
>> > [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush
>> > transport (to be expected if the pool is shutting down) in close for
>> > client:
>> > CassandraClient
>> > org.apache.thrift.transport.TTransportException:
>> > java.net.SocketException:
>> > Broken pipe
>> >       at
>> >
>> > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
>> >       at
>> >
>> > org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
>> >       at
>> >
>> > me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:85)
>> >       at
>> >
>> > me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
>> >       at
>> >
>> > me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
>> >       at
>> >
>> > me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289)
>> >       at
>> >
>> > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
>> >       at
>> >
>> > me.prettyprint.cassandra.model.thri

Re: Trying to find the problem with a broken pipe

2011-08-02 Thread Jim Ancona
On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda
 wrote:
> I'm not sure if this is a problem with Hector or with Cassandra.
> We seem to be seeing broken pipe issues with our connections on the client
> side (Exception below). A bit of googling finds possibly a problem with the
> amount of data we are trying to store, although I'm certain our datasets are
> not all that large.

I'm not sure what you're referring to here. Large requests could lead
to timeouts, but that's not what you're seeing here. Could you link to
the page you're referencing?

> A nodetool ring command doesn't seem to present any downed nodes:
> Address         DC          Rack        Status State   Load            Owns
>    Token
>
>   153951716904446304929228999025275230571
> 10.130.202.34   datacenter1 rack1       Up     Normal  470.74 KB
> 79.19%  118538200848404459763384037192174096102
> 10.130.202.35   datacenter1 rack1       Up     Normal  483.63 KB
> 20.81%  153951716904446304929228999025275230571
>
> There are no errors in the cassandra server logs.
>
> Are there any particular timeouts on connections that we need to be aware
> of? Or perhaps configure on the Cassandra nodes? Is this purely and issue
> with the Hector API configuration?

There is a server side timeout (rpc_timeout_in_ms in cassandra.yaml)
and a Hector client-side timeout
(CassandraHostConfigurator.cassandraThriftSocketTimeout). But again,
the "Broken pipe" error is not a timeout, it indicates that something
happened to the underlying network socket. For example you will see
those when a server node is restarted.

Some questions that might help troubleshoot this:
How often are these occurring?
Does this affect both nodes in the cluster or just one?
Are the exceptions related to any external events (e.g. node restarts,
network issues...)?
What versions of Hector and Cassandra are you running?

Keep in mind that failures like this will normally be retried by
Hector, resulting in no loss of data. For that reason, I think that
exception is logged as a warning in the newest Hector versions.

We've seen something similar, but more catastrophic because it affects
connectivity to the entire cluster, not just a single node. See this
post for more details: http://goo.gl/hrgkw So far we haven't
identified the cause.

Jim

> Anthony
>
> 2011-08-02 08:43:06,541 ERROR
> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush
> transport (to be expected if the pool is shutting down) in close for client:
> CassandraClient
> org.apache.thrift.transport.TTransportException: java.net.SocketException:
> Broken pipe
>   at
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
>   at
> org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
>   at
> me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:85)
>   at
> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
>   at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
>   at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289)
>   at
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
>   at
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49)
>   at
> me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
>   at
> me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
>   at
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48)
>   at
> com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.findRegistryByProfileId(StandardFindRegistryPersistenceStrategy.java:237)
>   at
> com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.execute(StandardFindRegistryPersistenceStrategy.java:277)
>   at
> com.wsgc.services.registry.registryservice.impl.service.StandardRegistryService.getRegistriesByProfileId(StandardRegistryService.java:327)
>   at
> com.wsgc.services.registry.webapp.impl.RegistryServicesController.getRegistriesByProfileId(RegistryServicesController.java:247)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at
> org.springframework.web.bind.annotation.support.HandlerMethodInvoker.invokeHandlerMethod(HandlerMethodInvoker.java:175)
>   at
> org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.invokeHandlerM

Re: cassandra server disk full

2011-08-02 Thread Jim Ancona
On Mon, Aug 1, 2011 at 6:12 PM, Ryan King  wrote:
> On Fri, Jul 29, 2011 at 12:02 PM, Chris Burroughs
>  wrote:
>> On 07/25/2011 01:53 PM, Ryan King wrote:
>>> Actually I was wrong– our patch will disable gosisp and thrift but
>>> leave the process running:
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-2118
>>>
>>> If people are interested in that I can make sure its up to date with
>>> our latest version.
>>
>> Thanks Ryan.
>>
>> /me expresses interest.

/me too!

>>
>> Zombie nodes when the file system does something "interesting" are not fun.
>
> In our experience this only gets triggered on hardware failures that
> would otherwise seriously degrade the performance or cause lots of
> errors.
>
> After the nodes traffic coalesces we get an alert which we can then deal with.
>
> -ryan
>


Re: Damaged commit log disk causes Cassandra client to get stuck

2011-08-02 Thread Jim Ancona
Sorry to follow-up to my own post but I just saw this issue:
https://issues.apache.org/jira/browse/CASSANDRA-2118 linked in a
neighboring thread (cassandra server disk full). It certainly implies
that a disk IO failure resulting in a "zombie" node is a possibility.

Jim

On Tue, Aug 2, 2011 at 4:19 PM, Jim Ancona  wrote:
> Ideally, I would hope that a bad disk wouldn't hang a node but would
> instead just cause writes to fail, but if that is not the case,
> perhaps the bad disk somehow wedged that server node completely so
> that requests were not being processed at all (maybe not even being
> timed out). At that point you'd be depending on Hector's
> CassandraHostConfigurator.cassandraThriftSocketTimeout to expire,
> which would cause the request to fail over to a working node. But that
> value defaults to zero (i.e. forever), so if you didn't explicitly
> configure it your client would hang along with the server node.
>
> Perhaps someone with more knowledge of Cassandra's internals could
> comment on the possibility of the server hanging completely. I would
> think that the logs from the bad node might help to diagnose that.
>
> Jim
>
> On Sun, Jul 31, 2011 at 4:58 PM, aaron morton  wrote:
>> Yup, it sounds like things may not have failed as their should. Do you have
>> a better definition of stuck ? Was the client waiting for a single request
>> to completed or was the client not cycling to another node ?
>>  If there is some server log details out it may help understand what
>> happened. Also what setting you had for  commitlog_sync in the yaml.  Also
>> some info on the failure, did the disk stop dead, or run slowly, or fail
>> sometimes etc.
>> AFAIK the wait on the writes to return should have timed out on the
>> coordinator. I may be behind on the expected behaviour, perhaps a thread
>> pool was shutdown as part of handling the error and this prevents the error
>> from returning.
>> I would check the rpc_timeout in the yaml, and that the client is setting a
>> client side socket time out. Timeouts should kick in. Then check the
>> expected behaviour for Hector in when it gets a timeout.
>> Cheers
>>
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> On 1 Aug 2011, at 09:40, Lior Golan wrote:
>>
>> Thanks Aaron. We will try to pull the logs and post them in this forum.
>>
>> But what I don't understand is why the client should pause at all. We are
>> writing with CL.ONE, and the replication factor is 2. As far as we
>> understand – the client communicates with a certain node (any node for that
>> matter) StorageProxy, which then sends write requests to all 2 replicas, but
>> wait for just the first one of them to acknowledge the write.ii
>>
>> So even if one node got stuck because of this commit log disk failure, it
>> should not have stuck the client. Can you explain why that ever happened in
>> the first place?
>>
>> And to add to that – when we took down the Cassandra node with the faulty
>> commit log disk, the client continued to write and didn't seem to bother
>> (which is what we expected to happen in the first place, but didn't).
>>
>> From: aaron morton [mailto:aa...@thelastpickle.com]
>> Sent: Monday, August 01, 2011 12:19 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Damaged commit log disk causes Cassandra client to get stuck
>>
>> A couple of timeouts should have kicked in.
>>
>> First the rpc_timeout on the server side should have kicked in and given the
>> client a (thrift) TimedOutException. Secondly a client side socket timeout
>> should be set so the client will timeout the socket. Did either of these
>> appear in the client side logs?
>>
>> In response to either of those my guess would be that hector would cycle the
>> connection. (I've not checked this.)
>>
>> How did the disk fail ? Was their anything in the server logs ?
>>
>> Some background about handling disk
>> fails https://issues.apache.org/jira/browse/CASSANDRA-809
>>
>> Cheers
>>
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 1 Aug 2011, at 08:13, Lior Golan wrote:
>>
>> In one of our test clusters we had a damaged commit log disks in one of the
>> nodes.
>>
>> We have replication factor = 2 in this cluster, and write with consistency
>> level = ONE. So we expected writes will not be affected by such an issue.
>> But what actually happened is that the client that was writing with CL.ONE
>> got stuck. The client could resume writing when we stopped the server with
>> the faulty disk (so this is another indication it's not a replication factor
>> or consistency level issue).
>>
>> We are running Cassandra 0.7.6, and the client we're using is Hector.
>>
>> Can anyone explain what happened here? Why the client got stuck when the
>> commit log disk on one of the servers damaged (and could resume writing if
>> we actually took off that server)?
>>
>


Re: Damaged commit log disk causes Cassandra client to get stuck

2011-08-02 Thread Jim Ancona
Ideally, I would hope that a bad disk wouldn't hang a node but would
instead just cause writes to fail, but if that is not the case,
perhaps the bad disk somehow wedged that server node completely so
that requests were not being processed at all (maybe not even being
timed out). At that point you'd be depending on Hector's
CassandraHostConfigurator.cassandraThriftSocketTimeout to expire,
which would cause the request to fail over to a working node. But that
value defaults to zero (i.e. forever), so if you didn't explicitly
configure it your client would hang along with the server node.

Perhaps someone with more knowledge of Cassandra's internals could
comment on the possibility of the server hanging completely. I would
think that the logs from the bad node might help to diagnose that.

Jim

On Sun, Jul 31, 2011 at 4:58 PM, aaron morton  wrote:
> Yup, it sounds like things may not have failed as their should. Do you have
> a better definition of stuck ? Was the client waiting for a single request
> to completed or was the client not cycling to another node ?
>  If there is some server log details out it may help understand what
> happened. Also what setting you had for  commitlog_sync in the yaml.  Also
> some info on the failure, did the disk stop dead, or run slowly, or fail
> sometimes etc.
> AFAIK the wait on the writes to return should have timed out on the
> coordinator. I may be behind on the expected behaviour, perhaps a thread
> pool was shutdown as part of handling the error and this prevents the error
> from returning.
> I would check the rpc_timeout in the yaml, and that the client is setting a
> client side socket time out. Timeouts should kick in. Then check the
> expected behaviour for Hector in when it gets a timeout.
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> On 1 Aug 2011, at 09:40, Lior Golan wrote:
>
> Thanks Aaron. We will try to pull the logs and post them in this forum.
>
> But what I don't understand is why the client should pause at all. We are
> writing with CL.ONE, and the replication factor is 2. As far as we
> understand – the client communicates with a certain node (any node for that
> matter) StorageProxy, which then sends write requests to all 2 replicas, but
> wait for just the first one of them to acknowledge the write.ii
>
> So even if one node got stuck because of this commit log disk failure, it
> should not have stuck the client. Can you explain why that ever happened in
> the first place?
>
> And to add to that – when we took down the Cassandra node with the faulty
> commit log disk, the client continued to write and didn't seem to bother
> (which is what we expected to happen in the first place, but didn't).
>
> From: aaron morton [mailto:aa...@thelastpickle.com]
> Sent: Monday, August 01, 2011 12:19 AM
> To: user@cassandra.apache.org
> Subject: Re: Damaged commit log disk causes Cassandra client to get stuck
>
> A couple of timeouts should have kicked in.
>
> First the rpc_timeout on the server side should have kicked in and given the
> client a (thrift) TimedOutException. Secondly a client side socket timeout
> should be set so the client will timeout the socket. Did either of these
> appear in the client side logs?
>
> In response to either of those my guess would be that hector would cycle the
> connection. (I've not checked this.)
>
> How did the disk fail ? Was their anything in the server logs ?
>
> Some background about handling disk
> fails https://issues.apache.org/jira/browse/CASSANDRA-809
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 1 Aug 2011, at 08:13, Lior Golan wrote:
>
> In one of our test clusters we had a damaged commit log disks in one of the
> nodes.
>
> We have replication factor = 2 in this cluster, and write with consistency
> level = ONE. So we expected writes will not be affected by such an issue.
> But what actually happened is that the client that was writing with CL.ONE
> got stuck. The client could resume writing when we stopped the server with
> the faulty disk (so this is another indication it's not a replication factor
> or consistency level issue).
>
> We are running Cassandra 0.7.6, and the client we're using is Hector.
>
> Can anyone explain what happened here? Why the client got stuck when the
> commit log disk on one of the servers damaged (and could resume writing if
> we actually took off that server)?
>


Re: do I need to add more nodes? minor compaction eat all IO

2011-07-26 Thread Jim Ancona
On Mon, Jul 25, 2011 at 6:41 PM, aaron morton  wrote:
> There are no hard and fast rules to add new nodes, but here are two 
> guidelines:
>
> 1) Single node load is getting too high, rule of thumb is 300GB is probably 
> too high.

What is that rule of thumb based on? I would guess that working set
size would matter more than absolute size. Why isn't that the case?

Jim


Cassandra client loses connectivity to cluster

2011-06-29 Thread Jim Ancona
In reviewing client logs as part of our Cassandra testing, I noticed
several Hector "All host pools marked down" exceptions in the logs.
Further investigation showed a consistent pattern of
"java.net.SocketException: Broken pipe" and "java.net.SocketException:
Connection reset" messages. These errors occur for all 36 hosts in the
cluster over a period of seconds, as Hector tries to find a working
host to connect to. Failing to find a host results in the "All host
pools marked down" messages. These messages recur for a period ranging
from several seconds up to almost 15 minutes, clustering around two to
three minutes. Then connectivity returns and when Hector tries to
reconnect it succeeds.

The clients are instances of a JBoss 5 web application. We use Hector
0.7.0-29 (plus a patch that was pulled in advance of -30) The
Cassandra cluster has 72 nodes split between two datacenters. It's
running 0.7.5 plus a couple of bug fixes pulled in advance of 0.7.6.
The keyspace uses NetworkTopologyStrategy and RF=6 (3 in each
datacenter). The clients are reading and writing at LOCAL_QUORUM to
the 36 nodes in their own data center. Right now the second datacenter
is for failover only, so there are no clients actually writing there.

There's nothing else obvious in the JBoss logs at around the same
time, e.g. other application errors, GC events. The Cassandra
system.log files at INFO level shows nothing out of the ordinary. I
have a capture of one of the incidents at DEBUG level where again I
see nothing abnormal looking, but there's so much data that it would
be easy to miss something.

Other observations:
* It only happens on weekdays (Our weekends are much lower load)
* It has occurred every weekday for the last month except for Monday
May 30, the Memorial Day holiday in the US.
* Most days it occurs only once, but six times it has occurred twice,
never more often than that.
* It generally happens in the late afternoon, but there have been
occurrences earlier in the afternoon and twice in the late morning.
Earliest occurrence is 11:19 am, latest is 18:11 pm. Our peak loads
are between 10:00 and 14:00, so most occurrences do *not* correspond
with peak load times.
* It only happens on a single client JBoss instance at a time.
* Generally, it affects a different host each day, but the same host
was affected on consecutive days once.
* Out of 40 clients, one has been affected three times, seven have
been affected twice, 11 have been affected once and 21 have not been
affected.
* The cluster is lightly loaded.

Given that the problem affects a single client machine at a time and
that machine loses the ability to connect to the entire cluster, It
seems unlikely that the problem is on the C* server side. Even a
network problem seems hard to explain, given that the clients are on
the same subnet, I would expect all of them to fail if it were a
network issue.

I'm hoping that perhaps someone has seen a similar issue or can
suggest things to try.

Thanks in advance for any help!

Jim


Re: UnsupportedOperationException: Index manager cannot support deleting and inserting into a row in the same mutation

2011-06-23 Thread Jim Ancona
I've reopened the issue. On our 0.7.6-2 cluster, system.log is filling
with repeated instances of the UnsupportedOperationException. When
we've attempted to restart a node, the restart fails with the same
exception. Luckily we found this as part of our pre-deploy testing of
0.7.6, not in production, but this is not "mostly a non-problem" here.

Jim

On Thu, Jun 23, 2011 at 3:25 PM, Jonathan Ellis  wrote:
> The patch probably applies as-is but I don't want to take any risks
> with 0.7 to solve what is mostly a non-problem.
>
> On Thu, Jun 23, 2011 at 2:16 PM, Jim Ancona  wrote:
>> Is there any reason this fix can't be back-ported to 0.7?
>>
>> Jim
>>
>> On Thu, Jun 23, 2011 at 3:00 PM, Jonathan Ellis  wrote:
>>> Sorry, 0.8.2 is correct.
>>>
>>> On Thu, Jun 23, 2011 at 1:36 PM, Les Hazlewood  wrote:
>>>> The issue has the fix version as 0.8.2, not 0.7.7.  Is that incorrect?
>>>> Cheers,
>>>> Les
>>>>
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Re: UnsupportedOperationException: Index manager cannot support deleting and inserting into a row in the same mutation

2011-06-23 Thread Jim Ancona
Is there any reason this fix can't be back-ported to 0.7?

Jim

On Thu, Jun 23, 2011 at 3:00 PM, Jonathan Ellis  wrote:
> Sorry, 0.8.2 is correct.
>
> On Thu, Jun 23, 2011 at 1:36 PM, Les Hazlewood  wrote:
>> The issue has the fix version as 0.8.2, not 0.7.7.  Is that incorrect?
>> Cheers,
>> Les
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


UnsupportedOperationException: Index manager cannot support deleting and inserting into a row in the same mutation

2011-06-23 Thread Jim Ancona
Since upgrading to 0.7.6-2 I'm seeing the following exception in our
server logs:

ERROR [MutationStage:1184874] 2011-06-22 23:59:43,867
AbstractCassandraDaemon.java (line 114) Fatal exception in thread
Thread[MutationStage:1184874,5,main]
java.lang.UnsupportedOperationException: Index manager cannot support
deleting and inserting into a row in the same mutation
   at org.apache.cassandra.db.Table.ignoreObsoleteMutations(Table.java:431)
   at org.apache.cassandra.db.Table.apply(Table.java:387)
   at 
org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:75)
   at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:619)

I see that the code that throws this exception was added as part of
the fix for https://issues.apache.org/jira/browse/CASSANDRA-2401

We do have clients that do deletes and inserts in the same mutation
(probably unnecessarily), but is this really something that is not
supported? Or is there some other issue behind this?

Thanks,

Jim


Internal error when using SimpleSnitch and dynamic_snitch: true

2011-01-17 Thread Jim Ancona
We accidently configured our cluster with SimpleSnitch (instead of
PropertyFileSnitch) and dynamic_snitch: true. This is with version
0.7.0.

We saw the errors below on get_slice and batch_mutate calls. The
errors went away when we switched to PropertyFileSnitch.

Should dynamic_snitch work with SimpleSnitch? Should I open a Jira issue?

Jim

ERROR [pool-1-thread-55] 2011-01-14 15:53:45,998 Cassandra.java (line
2707) Internal error processing get_slice
java.lang.UnsupportedOperationException
at 
org.apache.cassandra.locator.SimpleSnitch.getDatacenter(SimpleSnitch.java:40)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.getDatacenter(DynamicEndpointSnitch.java:94)
at 
org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(NetworkTopologyStrategy.java:87)
at 
org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:99)
at 
org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1354)
at 
org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1337)
at 
org.apache.cassandra.service.StorageService.findSuitableEndpoint(StorageService.java:1388)
at 
org.apache.cassandra.service.StorageProxy.weakRead(StorageProxy.java:248)
at 
org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:224)
at 
org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:98)
at 
org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:195)
at 
org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:271)
at 
org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:233)
at 
org.apache.cassandra.thrift.Cassandra$Processor$get_slice.process(Cassandra.java:2699)
at 
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ERROR [pool-1-thread-58] 2011-01-12 12:18:58,721 Cassandra.java (line
3044) Internal error processing batch_mutate
java.lang.UnsupportedOperationException
at 
org.apache.cassandra.locator.SimpleSnitch.getDatacenter(SimpleSnitch.java:40)
at 
org.apache.cassandra.locator.DynamicEndpointSnitch.getDatacenter(DynamicEndpointSnitch.java:94)
at 
org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(NetworkTopologyStrategy.java:87)
at 
org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:99)
at 
org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1354)
at 
org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1337)
at 
org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:109)
at 
org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:412)
at 
org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:385)
at 
org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3036)
at 
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)


Re: Could Not connect to cassandra-cli on windows

2010-11-09 Thread Jim Ancona
On Mon, Nov 8, 2010 at 8:31 PM, Alaa Zubaidi  wrote:
> Hi,
> Failing to connect to cassandra client: on windows
>
> [defa...@unknown] connect localhost/9160
> Exception connecting to localhost/9160. Reason: Connection refused: connect.
>
> [defa...@unknown] connect xxx.xxx.x.xx/9160
> Syntax error at position 0: no viable alternative at input 'connect'
>
> [defa...@unknown] connect xxx.xxx.x.xx\9160
> Syntax error at position 20: no viable alternative at character '\'
>
> Any idea how to connect?

In the first example, are you sure Cassandra was running on port 9160
on your local machine?

You could also try
cassandra-cli -host hostname -port port
e.g.
cassandra-cli -host localhost -port 9160

Jim


Re: Any plans to support key metadata?

2010-10-29 Thread Jim Ancona
On Fri, Oct 29, 2010 at 10:07 AM, Jim Ancona  wrote:
> In 0.7, Cassandra now supports column metadata
> CfDef.default_validation_class and ColumnDef.validation_class. Is
> there any plan to provide similar metadata for keys, at the key space
> or column family level?

Sorry to respond to my own email, but in re-reading it I realized that
I meant to say "type metadata", not just "metadata". I'm interested in
optionally being able to store a key type in the same way we now have
column types.

>
> Jim
>


Any plans to support key metadata?

2010-10-29 Thread Jim Ancona
In 0.7, Cassandra now supports column metadata
CfDef.default_validation_class and ColumnDef.validation_class. Is
there any plan to provide similar metadata for keys, at the key space
or column family level?

Jim