Re: Welcome Dinesh Joshi as Cassandra PMC member

2021-06-04 Thread Dikang Gu
Congrats Dinesh!

On Thu, Jun 3, 2021 at 3:59 PM Patrick McFadin  wrote:

> This is great. Congratulations Dinesh!
>
> On Thu, Jun 3, 2021 at 11:51 AM Jordan West  wrote:
>
> > Congratulations Dinesh!
> >
> > Jordan
> >
> > On Thu, Jun 3, 2021 at 1:40 AM Mick Semb Wever  wrote:
> >
> > > Congrats Dinesh. Thanks for all the help given and offered whenever it
> is
> > > needed!
> > >
> > > On Wed, 2 Jun 2021 at 18:16, Benjamin Lerer  wrote:
> > >
> > > >  The PMC's members are pleased to announce that Dinesh Joshi has
> > accepted
> > > > the invitation to become a PMC member.
> > > >
> > > > Thanks a lot, Dinesh, for everything you have done for the project
> all
> > > > these years.
> > > >
> > > > Congratulations and welcome
> > > >
> > > > The Apache Cassandra PMC members
> > > >
> > >
> >
>


-- 
Dikang


Re: Welcome Yifan Cai as Cassandra committer

2020-12-22 Thread Dikang Gu
Congrats Yifan!

Dikang

On Mon, Dec 21, 2020 at 9:11 AM Benjamin Lerer 
wrote:

>  The PMC's members are pleased to announce that Yifan Cai has accepted the
> invitation to become committer last Friday.
>
> Thanks a lot, Yifan,  for everything you have done!
>
> Congratulations and welcome
>
> The Apache Cassandra PMC members
>


-- 
Dikang


Re: Apache Cassandra meetup @ Instagram HQ

2019-02-22 Thread Dikang Gu
Yeah, we have recorded the meetup, will publish it online in next couple
weeks.

Thanks
Dikang

On Fri, Feb 22, 2019 at 3:35 PM Sundaramoorthy, Natarajan <
natarajan_sundaramoor...@optum.com> wrote:

> Dinesh - Please let us know if they were able to record and post it
> online? Thanks
>
>
>
> -Original Message-
> From: Ahmed Eljami [mailto:ahmed.elj...@gmail.com]
> Sent: Friday, February 22, 2019 10:04 AM
> To: dev@cassandra.apache.org; dinesh.jo...@yahoo.com
> Subject: Re: Apache Cassandra meetup @ Instagram HQ
>
> Great !
> Thanks Dinesh
>
> Le mer. 20 févr. 2019 à 19:57, dinesh.jo...@yahoo.com.INVALID
>  a écrit :
>
> > Just heard back. Unfortunately they cannot live stream but they'll try to
> > record it and post it online.
> > Dinesh
> >
> > On Tuesday, February 19, 2019, 3:55:15 PM PST, Sundaramoorthy,
> > Natarajan  wrote:
> >
> >  Dinesh Thank you.
> >
> >
> >
> > -Original Message-
> > From: andre wilkinson [mailto:antonio...@me.com.INVALID]
> > Sent: Tuesday, February 19, 2019 5:14 PM
> > To: dev@cassandra.apache.org
> > Subject: Re: Apache Cassandra meetup @ Instagram HQ
> >
> > Thank you for reaching out
> >
> > Sent from my iPhone
> >
> > > On Feb 19, 2019, at 8:45 AM, Dinesh Joshi  .invalid>
> > wrote:
> > >
> > > I’ve emailed the organizers. They didn’t intend to live stream it but
> > will get back to us if they can.
> > >
> > > Cheers,
> > >
> > > Dinesh
> > >
> > >> On Feb 19, 2019, at 1:11 AM, andre wilkinson
> 
> > wrote:
> > >>
> > >> Yes is there anyway to attend remotely?
> > >>
> > >> Sent from my iPhone
> > >>
> > >>> On Feb 18, 2019, at 6:40 PM, Shaurya Gupta 
> > wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> This looks very interesting to me. Can I attend this remotely?
> > >>>
> > >>> Thanks
> > >>> Shaurya
> > >>>
> > >>>
> > >>> On Tue, Feb 19, 2019 at 5:37 AM dinesh.jo...@yahoo.com.INVALID
> > >>>  wrote:
> > >>>
> >  Hi all,
> > 
> >  Apologies for the cross-post. In case you're in the SF Bay Area,
> > Instagram
> >  is hosting a meetup. Interesting talks on Cassandra Traffic
> > management,
> >  Cassandra on Kubernetes. See details in the attached link -
> > 
> > 
> > 
> >
> https://www.eventbrite.com/e/cassandra-traffic-management-at-instagram-cassandra-and-k8s-with-instaclustr-tickets-54986803008
> > 
> >  Thanks,
> > 
> >  Dinesh
> > 
> > >>>
> > >>>
> > >>> --
> > >>> Shaurya Gupta
> > >>
> > >>
> > >> -
> > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >>
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> > This e-mail, including attachments, may include confidential and/or
> > proprietary information, and may be used only by the person or entity
> > to which it is addressed. If the reader of this e-mail is not the
> intended
> > recipient or his or her authorized agent, the reader is hereby notified
> > that any dissemination, distribution or copying of this e-mail is
> > prohibited. If you have received this e-mail in error, please notify the
> > sender by replying to this message and delete this e-mail immediately.
> >
> B�CB��[��X��ܚX�KK[XZ[�]�][��X��ܚX�P�\��[��K�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[�]�Z[�\��[��K�\X�K�ܙ�B�B
> >
>
>
>
> --
> Cordialement;
>
> Ahmed ELJAMI
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>


-- 
Dikang


Re: CASSANDRA-14482

2019-02-15 Thread Dikang Gu
+1

On Fri, Feb 15, 2019 at 10:27 AM Vinay Chella 
wrote:

> We have been using Zstd compressor across different products/services here
> and have seen significant improvements, getting this in 4.0 would be a big
> win.
>
> +1
>
> Thanks,
> Vinay Chella
>
>
> On Fri, Feb 15, 2019 at 10:19 AM Jeff Jirsa  wrote:
>
> > +1
> >
> > --
> > Jeff Jirsa
> >
> >
> > > On Feb 15, 2019, at 9:35 AM, Jonathan Ellis  wrote:
> > >
> > > IMO "add a new compression class that has demonstrable benefits to
> Sushma
> > > and Joseph" is sufficiently noninvasive that we should allow it into
> 4.0.
> > >
> > > On Fri, Feb 15, 2019 at 10:48 AM Dinesh Joshi
> > >  wrote:
> > >
> > >> Hey folks,
> > >>
> > >> Just wanted to get a pulse on whether we can proceed with ZStd
> support.
> > >> The consensus on the ticket was that it’s a very valuable addition
> > without
> > >> any risk of destabilizing 4.0. It’s ready to go if there aren’t any
> > >> objections.
> > >>
> > >> Dinesh
> > >>
> > >> -
> > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >>
> > >>
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
>


-- 
Dikang


Re: [DISCUSS] changing default token behavior for 4.0

2018-09-21 Thread Dikang Gu
We are using 8 or 16 tokens internally, with the token allocation algorithm
enabled. The range distribution is good for us.

Dikang.

On Fri, Sep 21, 2018 at 9:30 PM Dinesh Joshi 
wrote:

> Jon, thanks for starting this thread!
>
> I have created CASSANDRA-14784 to track this.
>
> Dinesh
>
> > On Sep 21, 2018, at 9:18 PM, Sankalp Kohli 
> wrote:
> >
> > Putting it on JIRA is to make sure someone is assigned to it and it is
> tracked. Changes should be discussed over ML like you are saying.
> >
> > On Sep 21, 2018, at 21:02, Jonathan Haddad  wrote:
> >
> >>> We should create a JIRA to find what other defaults we need revisit.
> >>
> >> Changing a default is a pretty big deal, I think we should discuss any
> >> changes to defaults here on the ML before moving it into JIRA.  It's
> nice
> >> to get a bit more discussion around the change than what happens in
> JIRA.
> >>
> >> We (TLP) did some testing on 4 tokens and found it to work surprisingly
> >> well.   It wasn't particularly formal, but we verified the load stays
> >> pretty even with only 4 tokens as we added nodes to the cluster.  Higher
> >> token count hurts availability by increasing the number of nodes any
> given
> >> node is a neighbor with, meaning any 2 nodes that fail have an increased
> >> chance of downtime when using QUORUM.  In addition, with the recent
> >> streaming optimization it seems the token counts will give a greater
> chance
> >> of a node streaming entire sstables (with LCS), meaning we'll do a
> better
> >> job with node density out of the box.
> >>
> >> Next week I can try to put together something a little more convincing.
> >> Weekend time.
> >>
> >> Jon
> >>
> >>
> >> On Fri, Sep 21, 2018 at 8:45 PM sankalp kohli 
> >> wrote:
> >>
> >>> +1 to lowering it.
> >>> Thanks Jon for starting this.We should create a JIRA to find what other
> >>> defaults we need revisit. (Please keep this discussion for "default
> token"
> >>> only.  )
> >>>
>  On Fri, Sep 21, 2018 at 8:26 PM Jeff Jirsa  wrote:
> 
>  Also agree it should be lowered, but definitely not to 1, and probably
>  something closer to 32 than 4.
> 
>  --
>  Jeff Jirsa
> 
> 
> > On Sep 21, 2018, at 8:24 PM, Jeremy Hanna <
> jeremy.hanna1...@gmail.com>
>  wrote:
> >
> > I agree that it should be lowered. What I’ve seen debated a bit in
> the
>  past is the number but I don’t think anyone thinks that it should
> remain
>  256.
> >
> >> On Sep 21, 2018, at 7:05 PM, Jonathan Haddad 
> >>> wrote:
> >>
> >> One thing that's really, really bothered me for a while is how we
>  default
> >> to 256 tokens still.  There's no experienced operator that leaves it
> >>> as
>  is
> >> at this point, meaning the only people using 256 are the poor folks
> >>> that
> >> just got started using C*.  I've worked with over a hundred clusters
> >>> in
>  the
> >> last couple years, and I think I only worked with one that had
> lowered
>  it
> >> to something else.
> >>
> >> I think it's time we changed the default to 4 (or 8, up for debate).
> >>
> >> To improve the behavior, we need to change a couple other things.
> The
> >> allocate_tokens_for_keyspace setting is... odd.  It requires you
> have
> >>> a
> >> keyspace already created, which doesn't help on new clusters.  What
> >>> I'd
> >> like to do is add a new setting, allocate_tokens_for_rf, and set it
> to
>  3 by
> >> default.
> >>
> >> To handle clusters that are already using 256 tokens, we could
> prevent
>  the
> >> new node from joining unless a -D flag is set to explicitly allow
> >> imbalanced tokens.
> >>
> >> We've agreed to a trunk freeze, but I feel like this is important
> >>> enough
> >> (and pretty trivial) to do now.  I'd also personally characterize
> this
>  as a
> >> bug fix since 256 is horribly broken when the cluster gets to any
> >> reasonable size, but maybe I'm alone there.
> >>
> >> I honestly can't think of a use case where random tokens is a good
>  choice
> >> anymore, so I'd be fine / ecstatic with removing it completely and
> >> requiring either allocate_tokens_for_keyspace (for existing
> clusters)
> >> or allocate_tokens_for_rf
> >> to be set.
> >>
> >> Thoughts?  Objections?
> >> --
> >> Jon Haddad
> >> http://www.rustyrazorblade.com
> >> twitter: rustyrazorblade
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> 
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>  For additional commands, e-mail: dev-h...@cassandra.apache.org
> 
> 
> >>>
> >>
> >>
> >> --
> >> Jon Haddad
> >

Re: Apache Cassandra Blog is now live

2018-08-07 Thread Dikang Gu
The fast streaming is very cool!

We are also interested in contributing to the blog, what's the process?

Thanks
Dikang.

On Tue, Aug 7, 2018 at 7:01 PM Nate McCall  wrote:

> You can tell how psyched we are about it because we cross posted!
>
> Seriously though - this is by the community for the community, so any
> ideas - please send them along.
>
> On Wed, Aug 8, 2018 at 1:53 PM, sankalp kohli 
> wrote:
> > Hi,
> >  Apache Cassandra Blog is now live. Check out the first blog post.
> >
> >
> http://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra.html
> >
> > Thanks,
> > Sankalp
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

-- 
Dikang


Re: NGCC 2018?

2018-07-27 Thread Dikang Gu
We are interested in bay area C* developers event as well.

On Thu, Jul 26, 2018 at 10:42 PM Jeff Jirsa  wrote:

> Bay area event is interesting to me, in any format.
>
>
> On Thu, Jul 26, 2018 at 9:03 PM, Ben Bromhead  wrote:
>
> > It sounds like there may be an appetite for something, but the NGCC in
> its
> > current format is likely to not be that useful?
> >
> > Is a bay area event focused on C* developers something that is
> interesting
> > for the broader dev community? In whatever format that may be?
> >
> > On Tue, Jul 24, 2018 at 5:02 PM Nate McCall  wrote:
> >
> > > This was discussed amongst the PMC recently. We did not come to a
> > > conclusion and there were not terribly strong feelings either way.
> > >
> > > I don't feel like we need to hustle to get "NGCC" in place,
> > > particularly given our decided focus on 4.0. However, that should not
> > > stop us from doing an additional 'c* developer' event in sept. to
> > > coincide with distributed data summit.
> > >
> > > On Wed, Jul 25, 2018 at 5:03 AM, Patrick McFadin 
> > > wrote:
> > > > Ben,
> > > >
> > > > Lynn Bender had offered a space the day before Distributed Data
> Summit
> > in
> > > > September (http://distributeddatasummit.com/) since we are both
> > platinum
> > > > sponsors. I thought he and Nate had talked about that being a good
> > place
> > > > for NGCC since many of us will be in town already.
> > > >
> > > > Nate, now that I've spoken for you, you can clarify, :D
> > > >
> > > > Patrick
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> > > --
> > Ben Bromhead
> > CTO | Instaclustr 
> > +1 650 284 9692
> > Reliability at Scale
> > Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
> >
>


-- 
Dikang


Re: Improve the performance of CAS

2018-05-16 Thread Dikang Gu
@Jason, pinged Sylvain on the jira.

@Jeremiah,
In the contention case, if we combine the prepare and quorum read together, we
will retry the Prepare phase, which may trigger the read on different
replicas again, it's a overhead. We can improve it by avoid executing the
read, if the replica already promised a ballot great than the prepared one.
In commit failure case, each replica should already have the
PartitionUpdate stored in system table, after the Propose phase. Then a
following readWithPaxos or cas operation, can repair the in progress paxos
state, and commit the data.

Thanks
Dikang.

On Wed, May 16, 2018 at 3:17 PM, J. D. Jordan 
wrote:

> I have not reasoned through this completely, but something I would want to
> see before messing with this is how changing the number of rounds behaves
> under contention and failure scenarios. Also how ignoring commit success
> behaves in those scenarios especially under contention and with respect to
> obeying CL semantics.
>
> -Jeremiah
>
> > On May 16, 2018, at 6:05 PM, Jason Brown  wrote:
> >
> > Hey all,
> >
> > Before we go bananas, let's see if Sylvain, the primary author of the
> > original patch, has the opportunity to chime with some explanatory notes
> or
> > other guidance. There may be some subtle points or considerations that
> are
> > not obvious, and I'd hate to lose that context.
> >
> > Thanks,
> >
> > -Jason
> >
> >> On Wed, May 16, 2018 at 2:57 PM, Ariel Weisberg 
> wrote:
> >>
> >> Hi,
> >>
> >> I think you are looking at the right low hanging fruit.  Cassandra
> >> deserves a better consensus protocol, but it's a very big project.
> >>
> >> Regards,
> >> Ariel
> >>> On Wed, May 16, 2018, at 5:51 PM, Dikang Gu wrote:
> >>> Cool, create a jira for it,
> >>> https://issues.apache.org/jira/browse/CASSANDRA-14448. I have a draft
> >> patch
> >>> working internally, will clean it up.
> >>>
> >>> The EPaxos is more complicated, could be a long term effort.
> >>>
> >>> Thanks
> >>> Dikang.
> >>>
> >>> On Wed, May 16, 2018 at 2:20 PM, sankalp kohli  >
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>The idea of combining read with prepare sounds good. Regarding
> >> reducing
> >>>> the commit round trip, it is possible today by giving a lower
> >> consistency
> >>>> level for commit I think.
> >>>>
> >>>> Regarding EPaxos, it is a large change and will take longer to land. I
> >>>> think we should do this as it will help lower the latencies a lot.
> >>>>
> >>>> Thanks,
> >>>> Sankalp
> >>>>
> >>>> On Wed, May 16, 2018 at 2:15 PM, Jeremy Hanna <
> >> jeremy.hanna1...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Dikang,
> >>>>>
> >>>>> Have you seen Blake’s work on implementing egalitarian paxos or
> >> epaxos*?
> >>>>> That might be helpful for the discussion.
> >>>>>
> >>>>> Jeremy
> >>>>>
> >>>>> * https://issues.apache.org/jira/browse/CASSANDRA-6246
> >>>>>
> >>>>>> On May 16, 2018, at 3:37 PM, Dikang Gu  wrote:
> >>>>>>
> >>>>>> Hello C* developers,
> >>>>>>
> >>>>>> I'm working on some performance improvements of the lightweight
> >>>>> transitions
> >>>>>> (compare and set), I'd like to hear your thoughts about it.
> >>>>>>
> >>>>>> As you know, current CAS requires 4 round trips to finish, which
> >> is not
> >>>>>> efficient, especially in cross DC case.
> >>>>>> 1) Prepare
> >>>>>> 2) Quorum read current value
> >>>>>> 3) Propose new value
> >>>>>> 4) Commit
> >>>>>>
> >>>>>> I'm proposing the following improvements to reduce it to 2 round
> >> trips,
> >>>>>> which is:
> >>>>>> 1) Combine prepare and quorum read together, use only one round
> >> trip to
> >>>>>> decide the ballot and also piggyback the current value in response.
> >>>>>> 2) Propose new value, and then send out the commit request
> >>>

Re: Improve the performance of CAS

2018-05-16 Thread Dikang Gu
Cool, create a jira for it,
https://issues.apache.org/jira/browse/CASSANDRA-14448. I have a draft patch
working internally, will clean it up.

The EPaxos is more complicated, could be a long term effort.

Thanks
Dikang.

On Wed, May 16, 2018 at 2:20 PM, sankalp kohli 
wrote:

> Hi,
> The idea of combining read with prepare sounds good. Regarding reducing
> the commit round trip, it is possible today by giving a lower consistency
> level for commit I think.
>
> Regarding EPaxos, it is a large change and will take longer to land. I
> think we should do this as it will help lower the latencies a lot.
>
> Thanks,
> Sankalp
>
> On Wed, May 16, 2018 at 2:15 PM, Jeremy Hanna 
> wrote:
>
> > Hi Dikang,
> >
> > Have you seen Blake’s work on implementing egalitarian paxos or epaxos*?
> > That might be helpful for the discussion.
> >
> > Jeremy
> >
> > * https://issues.apache.org/jira/browse/CASSANDRA-6246
> >
> > > On May 16, 2018, at 3:37 PM, Dikang Gu  wrote:
> > >
> > > Hello C* developers,
> > >
> > > I'm working on some performance improvements of the lightweight
> > transitions
> > > (compare and set), I'd like to hear your thoughts about it.
> > >
> > > As you know, current CAS requires 4 round trips to finish, which is not
> > > efficient, especially in cross DC case.
> > > 1) Prepare
> > > 2) Quorum read current value
> > > 3) Propose new value
> > > 4) Commit
> > >
> > > I'm proposing the following improvements to reduce it to 2 round trips,
> > > which is:
> > > 1) Combine prepare and quorum read together, use only one round trip to
> > > decide the ballot and also piggyback the current value in response.
> > > 2) Propose new value, and then send out the commit request
> > asynchronously,
> > > so client will not wait for the ack of the commit. In case of commit
> > > failures, we should still have chance to retry/repair it through hints
> or
> > > following read/cas events.
> > >
> > > After the improvement, we should be able to finish the CAS operation
> > using
> > > 2 rounds trips. There can be following improvements as well, and this
> can
> > > be a start point.
> > >
> > > What do you think? Did I miss anything?
> > >
> > > Thanks
> > > Dikang
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
>



-- 
Dikang


Improve the performance of CAS

2018-05-16 Thread Dikang Gu
Hello C* developers,

I'm working on some performance improvements of the lightweight transitions
(compare and set), I'd like to hear your thoughts about it.

As you know, current CAS requires 4 round trips to finish, which is not
efficient, especially in cross DC case.
1) Prepare
2) Quorum read current value
3) Propose new value
4) Commit

I'm proposing the following improvements to reduce it to 2 round trips,
which is:
1) Combine prepare and quorum read together, use only one round trip to
decide the ballot and also piggyback the current value in response.
2) Propose new value, and then send out the commit request asynchronously,
so client will not wait for the ack of the commit. In case of commit
failures, we should still have chance to retry/repair it through hints or
following read/cas events.

After the improvement, we should be able to finish the CAS operation using
2 rounds trips. There can be following improvements as well, and this can
be a start point.

What do you think? Did I miss anything?

Thanks
Dikang


Re: IN restrictions are not supported on indexed columns

2018-03-24 Thread Dikang Gu
Thanks Benjamin! Create a jira for it:
https://issues.apache.org/jira/browse/CASSANDRA-14344

On Fri, Mar 23, 2018 at 10:27 AM, Benjamin Lerer <
benjamin.le...@datastax.com> wrote:

> There are not real blocker I believe. It is just that we never implemented
> it.
>
> On Fri, Mar 23, 2018 at 6:08 PM, Dikang Gu  wrote:
>
> > Hello C* developers:
> >
> > I have one question, does anyone know why we can not support the IN
> > restrictions on indexed columns? Is it just because no one is working it?
> > Or are there any other reasons?
> >
> > Below is an example query:
> > 
> >
> > cqlsh:ks1> describe keyspace;
> >
> > CREATE KEYSPACE ks1 WITH replication = {'class': 'SimpleStrategy',
> > 'replication_factor': '1'}  AND durable_writes = true;
> >
> > CREATE TABLE ks1.t1 (
> > key int,
> > col1 int,
> > col2 int,
> > value int,
> > PRIMARY KEY (key, col1, col2)
> > ) WITH CLUSTERING ORDER BY (col1 ASC, col2 ASC)
> > AND bloom_filter_fp_chance = 0.01
> > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> > AND comment = ''
> > AND compaction = {'class':
> > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> > 'max_threshold': '32', 'min_threshold': '4'}
> > AND compression = {'chunk_length_in_kb': '64', 'class': '
> > org.apache.cassandra.io.compress.LZ4Compressor'}
> > AND crc_check_chance = 1.0
> > AND dclocal_read_repair_chance = 0.1
> > AND default_time_to_live = 0
> > AND gc_grace_seconds = 864000
> > AND max_index_interval = 2048
> > AND memtable_flush_period_in_ms = 0
> > AND min_index_interval = 128
> > AND read_repair_chance = 0.0
> > AND speculative_retry = '99PERCENTILE';
> >
> > cqlsh:ks1> select * from t1 where key = 1 and col2 in (1) allow
> filtering;
> >
> >  key | col1 | col2 | value
> > -+--+--+---
> >1 |1 |1 | 1
> >1 |2 |1 | 3
> >
> > (2 rows)
> > cqlsh:ks1> select * from t1 where key = 1 and col2 in (1, 2) allow
> > filtering;
> > *InvalidRequest: Error from server: code=2200 [Invalid query] message="IN
> > restrictions are not supported on indexed columns"*
> > cqlsh:ks1>
> > =
> >
> > Thanks
> >
> >
> > --
> > Dikang
> >
>



-- 
Dikang


IN restrictions are not supported on indexed columns

2018-03-23 Thread Dikang Gu
Hello C* developers:

I have one question, does anyone know why we can not support the IN
restrictions on indexed columns? Is it just because no one is working it?
Or are there any other reasons?

Below is an example query:


cqlsh:ks1> describe keyspace;

CREATE KEYSPACE ks1 WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'}  AND durable_writes = true;

CREATE TABLE ks1.t1 (
key int,
col1 int,
col2 int,
value int,
PRIMARY KEY (key, col1, col2)
) WITH CLUSTERING ORDER BY (col1 ASC, col2 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': '
org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';

cqlsh:ks1> select * from t1 where key = 1 and col2 in (1) allow filtering;

 key | col1 | col2 | value
-+--+--+---
   1 |1 |1 | 1
   1 |2 |1 | 3

(2 rows)
cqlsh:ks1> select * from t1 where key = 1 and col2 in (1, 2) allow
filtering;
*InvalidRequest: Error from server: code=2200 [Invalid query] message="IN
restrictions are not supported on indexed columns"*
cqlsh:ks1>
=

Thanks


--
Dikang


Rocksandra blog post

2018-03-05 Thread Dikang Gu
As some of you already know, Instagram Cassandra team is working on the
project to use RocksDB as Cassandra's storage engine.

Today, we just published a blog post about the work we have done, and more
excitingly, we published the benchmark metrics in AWS environment.

Check it out here:
https://engineering.instagram.com/open-sourcing-a-10x-reduction-in-apache-cassandra-tail-latency-d64f86b43589

Thanks
Dikang


Re: Cassandra Wrapup: Feb 2018 Edition

2018-03-04 Thread Dikang Gu
Congratulations Jay!

On Sun, Mar 4, 2018 at 6:38 PM, Jeff Jirsa  wrote:

> I'm late. Mea culpa. I blame February for only having 28 days.
>
> The following contributors had their first ever commit into the project
> (since the last time I made this list, which was late 2017)!
>
> Johannes Grassler
> Michael Burman
> Nicolas GUYOMAR
> Alex Ott
> Samuel Roberts
> Dinesh Joshi
> Amichai Rothman
> Vince White
> Sumanth Pasupuleti
> Samuel Fink
> Alexander Dejanovski
> Dimitar Dimitrov
> Kevin Wern
> Yuji Ito
>
> Jay Zhuang was recently added as a committer. Congrats Jay!
>
> There are some notably active topics to which I'd like to draw your
> attention, in case you haven't been reading email or following JIRA:
>
> 1) There's been a lot of talk about docs. There are a lot of new JIRAs for
> filling in the doc sections. Some of these could use review and commit (
> https://issues.apache.org/jira/browse/CASSANDRA-14128?
> jql=project%20%3D%20CASSANDRA%20AND%20component%20%3D%20%
> 22Documentation%20and%20Website%22%20and%20status%
> 20%3D%20%22Patch%20Available%22
> ) , some of them still need content (
> https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20CASSANDRA%20AND%20component%20%3D%20%22Documentation%20and%
> 20Website%22%20and%20status%20%3D%20Open
> ). A friendly reminder for anyone writing docs: respect other peoples'
> copyrights. This hasn't been a problem (as far as I can tell), but while
> other people/companies have written docs that are probably relevant, please
> don't go copying them verbatim. Docs should be some of the lowest bar to
> entry for new contributors - there's a nice howto here if you don't know
> how to contribute docss
> http://cassandra.apache.org/doc/latest/development/documentation.html )
>
> 2) There's a lot of activity around audit logging (
> https://issues.apache.org/jira/browse/CASSANDRA-12151 ) . There's a few
> other related tickets (Stefan's internal auditing events
> https://issues.apache.org/jira/browse/CASSANDRA-13668 , and the
> full-query-log patch at
> https://issues.apache.org/jira/browse/CASSANDRA-13983 ), but there's also
> a
> few different goals (as Joseph Lynch pointed out, there's at least 4 -
> security, compliance / SOX / PCI, replayability, debugging). If you're in
> the class of user that cares about these features (any of the 4), you
> should probably consider visiting that thread and reading the discussion.
>
> 3) Lerh Chuan Low has done a LOT of work on
> https://issues.apache.org/jira/browse/CASSANDRA-8460 (tiered storage). If
> you have any sort of desire to mix spinning+SSD disks in a single server,
> you may want to weigh in on the design.
>
> 4) There was an interesting conversation about performance of latency
> metrics. Started here:
> https://lists.apache.org/thread.html/e7067b3a8048ee62d4932064220333
> 9dc0f466fec6b3fdf6db575ad6@%3Cdev.cassandra.apache.org%3E
>
>
> 5) There are two or three different JIRAs/fixes floating around for the 2
> stupid MV unit tests that keep timing out. At least two of them were opened
> by committers, and I think both were reviewed by committers - please settle
> on one and commit it.
>
> Finally: If you're interested in learning more about cassandra from other
> users, the mailing list and JIRA have both been pretty busy this month, and
> that's nice. There are also Cassandra meetup groups all over the world - if
> you haven't ever attended one, I encourage you to find one (I'm not going
> to link to any, because I don't want it to look like there are any
> "official" groups, but search your favorite sites, you'll likely find one
> near you).
>
> I'm Jeff Jirsa, and this was the February 2018 Cassandra Dev Wrapup.
>



-- 
Dikang


Re: Timeout unit tests in trunk

2018-02-27 Thread Dikang Gu
Cool, I think I fixed the ViewTest by changing from updateView("TRUNCATE
%s") to execute("TRUNCATE %s").

I also split it into several smaller unit tests. Patch is here:
https://issues.apache.org/jira/browse/CASSANDRA-14280

I haven't got time to look into BatchMetricsTest yet.

Thanks
Dikang.

On Tue, Feb 27, 2018 at 6:00 PM, Jason Brown  wrote:

> All,
>
> As @kjellman pointed out, the timeouts on ViewTest & ViewBuilderTaskTest
> are being addressed in CASSANDRA-14194
> <https://issues.apache.org/jira/browse/CASSANDRA-14194> (I have a patch,
> almost ready to release).
>
> @dikang if you want to refactor those tests for fun, go for it - but note
> that the timeouts are currently not due to size.
>
> I have no idea about the BatchMetricsTest, but I can see if it is related
> to the others. @dikang, Do you have any details to share about the
> failures?
>
> -Jason
>
> On Tue, Feb 27, 2018 at 5:16 PM, Dinesh Joshi <
> dinesh.jo...@yahoo.com.invalid> wrote:
>
> > Yes, give it a go. I am not 100% sure if it will help a whole lot but try
> > it out and let's see what happens!
> > Dinesh
> >
> > On Tuesday, February 27, 2018, 2:57:41 PM PST, Dikang Gu <
> > dikan...@gmail.com> wrote:
> >
> >  I took some look at the cql3.ViewTest, it seems too big and timeout very
> > often. Any objections if I split it into two or multiple tests?
> >
> > On Tue, Feb 27, 2018 at 1:32 PM, Michael Kjellman 
> > wrote:
> >
> > > well, turns out we already have a jira tracking the MV tests being
> broken
> > > on trunk. they are legit broken :) thanks jaso
> > >
> > > https://issues.apache.org/jira/browse/CASSANDRA-14194
> > >
> > > not sure about the batch test timeout there though.. did you debug it
> at
> > > all by chance?
> > >
> > >
> > > On Feb 27, 2018, at 1:27 PM, Michael Kjellman  >  > > kjell...@apple.com>> wrote:
> > >
> > > hey dikang: just chatted a little bit about this. proposal: let's add
> the
> > > equivalent of @resource_intensive to unit tests too.. and the first one
> > is
> > > to stop from running the MV unit tests in the free circleci containers.
> > > thoughts?
> > >
> > > also, might want to bug your management to see if you can get some paid
> > > circleci resources. it's game changing!
> > >
> > > best,
> > > kjellman
> > >
> > > On Feb 27, 2018, at 12:12 PM, Dinesh Joshi  > > INVALID<mailto:dinesh.jo...@yahoo.com.INVALID>> wrote:
> > >
> > > Some tests might require additional resources to spin up the required
> > > components. 2 CPU / 4GB might not be sufficient. You may need to bump
> up
> > > the resources to 8CPU / 16GB.
> > > Dinesh
> > >
> > >  On Tuesday, February 27, 2018, 11:24:34 AM PST, Dikang Gu <
> > > dikan...@gmail.com<mailto:dikan...@gmail.com>> wrote:
> > >
> > > Looks like there are a few flaky/timeout unit tests in trunk, wondering
> > is
> > > there anyone looking at them already?
> > >
> > > testBuildRange - org.apache.cassandra.db.view.ViewBuilderTaskTest
> > > testUnloggedPartitionsPerBatch -
> > > org.apache.cassandra.metrics.BatchMetricsTest
> > > testViewBuilderResume - org.apache.cassandra.cql3.ViewTest
> > >
> > > https://circleci.com/gh/DikangGu/cassandra/20
> > >
> > > --
> > > Dikang
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org dev-
> > > unsubscr...@cassandra.apache.org>
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > dev-h...@cassandra.apache.org>
> > >
> > >
> > >
> >
> >
> > --
> > Dikang
> >
> >
>



-- 
Dikang


Re: Timeout unit tests in trunk

2018-02-27 Thread Dikang Gu
I took some look at the cql3.ViewTest, it seems too big and timeout very
often. Any objections if I split it into two or multiple tests?

On Tue, Feb 27, 2018 at 1:32 PM, Michael Kjellman 
wrote:

> well, turns out we already have a jira tracking the MV tests being broken
> on trunk. they are legit broken :) thanks jaso
>
> https://issues.apache.org/jira/browse/CASSANDRA-14194
>
> not sure about the batch test timeout there though.. did you debug it at
> all by chance?
>
>
> On Feb 27, 2018, at 1:27 PM, Michael Kjellman  kjell...@apple.com>> wrote:
>
> hey dikang: just chatted a little bit about this. proposal: let's add the
> equivalent of @resource_intensive to unit tests too.. and the first one is
> to stop from running the MV unit tests in the free circleci containers.
> thoughts?
>
> also, might want to bug your management to see if you can get some paid
> circleci resources. it's game changing!
>
> best,
> kjellman
>
> On Feb 27, 2018, at 12:12 PM, Dinesh Joshi  INVALID<mailto:dinesh.jo...@yahoo.com.INVALID>> wrote:
>
> Some tests might require additional resources to spin up the required
> components. 2 CPU / 4GB might not be sufficient. You may need to bump up
> the resources to 8CPU / 16GB.
> Dinesh
>
>   On Tuesday, February 27, 2018, 11:24:34 AM PST, Dikang Gu <
> dikan...@gmail.com<mailto:dikan...@gmail.com>> wrote:
>
> Looks like there are a few flaky/timeout unit tests in trunk, wondering is
> there anyone looking at them already?
>
> testBuildRange - org.apache.cassandra.db.view.ViewBuilderTaskTest
> testUnloggedPartitionsPerBatch -
> org.apache.cassandra.metrics.BatchMetricsTest
> testViewBuilderResume - org.apache.cassandra.cql3.ViewTest
>
> https://circleci.com/gh/DikangGu/cassandra/20
>
> --
> Dikang
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org<mailto:dev-
> unsubscr...@cassandra.apache.org>
> For additional commands, e-mail: dev-h...@cassandra.apache.org dev-h...@cassandra.apache.org>
>
>
>


-- 
Dikang


Timeout unit tests in trunk

2018-02-27 Thread Dikang Gu
Looks like there are a few flaky/timeout unit tests in trunk, wondering is
there anyone looking at them already?

testBuildRange - org.apache.cassandra.db.view.ViewBuilderTaskTest
testUnloggedPartitionsPerBatch -
org.apache.cassandra.metrics.BatchMetricsTest
testViewBuilderResume - org.apache.cassandra.cql3.ViewTest

https://circleci.com/gh/DikangGu/cassandra/20

-- 
Dikang


Re: [Proposal] Pluggable storage engine design

2017-12-06 Thread Dikang Gu
Thanks a lot for all the comments and feedbacks on the design doc, they are
all very useful.

I will wait for couple days for any new comments/feedbacks, then, if no
disagreements on the scope/guideline/high level plan of the project, I will
go ahead and create sub jiras for the major steps in the plan.

Thanks again for your help!
Dikang.

On Tue, Nov 28, 2017 at 12:23 PM, Dikang Gu  wrote:

> I create a more formal design proposal for the pluggable storage engine
> project, please take a look and let me know your comments.
>
>
> https://docs.google.com/document/d/1suZlvhzgB6NIyBNpM9nxoHxz_
> Ri7qAm-UEO8v8AIFsc
>
>
> At this moment, I'm focus on the purpose, scope, guideline and high level
> plan for the project. I'd like to collect your feedback on these sections,
> and get your agreement on them.
>
> I also list some high level design for the refactor of major components in
> Cassandra. But I plan to create individual jira for each of them and have
> deeper design discussions on those sub jiras later, after we agree on the
> scope/guideline of the project.
>
> Look forward to your feedbacks.
>
> Thanks!
> Dikang
>
>


-- 
Dikang


RocksDB meetup bayarea

2017-12-01 Thread Dikang Gu
Hi Dev,

Sorry for the lame promo but Pengchao Wang (@wpc), one of the main
contributors to our Cassandra on RocksDB hack will be presenting at the
annual RockDB meetup hosted at FB HQ.  I just wanted to put this on your
radar in case some might be interested.

https://www.meetup.com/RocksDB/events/243955027/

I hope seeing some you there!

Cheers.

Dikang


Sorry for repost, the previous email may not go through gmail
successfully...


[Proposal] Pluggable storage engine design

2017-11-28 Thread Dikang Gu
I create a more formal design proposal for the pluggable storage engine
project, please take a look and let me know your comments.


https://docs.google.com/document/d/1suZlvhzgB6NIyBNpM9nxoHxz_Ri7qAm-UEO8v8AIFsc


At this moment, I'm focus on the purpose, scope, guideline and high level
plan for the project. I'd like to collect your feedback on these sections,
and get your agreement on them.

I also list some high level design for the refactor of major components in
Cassandra. But I plan to create individual jira for each of them and have
deeper design discussions on those sub jiras later, after we agree on the
scope/guideline of the project.

Look forward to your feedbacks.

Thanks!
Dikang


Re: Pluggable storage engine discussion

2017-11-03 Thread Dikang Gu
Hi Stefan,

Yeah, my team has been working for 6+ months on the development and testing
of the RocksDB based storage engine, and we are in the process of rolling
it out to our production deployment inside Instagram! We see huge gains in
reducing the latency and footprint of our Cassandra clusters. I will
publish those when I have more metrics after the rollout.

I do hope someone in community can try it out as well, in your shadow
cluster of course. We have pushed the code here:
https://github.com/Instagram/cassandra/commits/rocks_3.0 .

On Fri, Nov 3, 2017 at 1:48 PM, Stefan Podkowinski  wrote:

> Hi Dikang
>
> Have you been able to continue evaluating RocksDB? I'm afraid we might
> be a bit too much ahead in the discussion by already talking about a
> pluggable architecture, while we haven't fully evaluated yet if we can
> and want to support an alternative RocksDB engine implementation at all.
> Because if we don't, we also don't need a pluggable architecture at this
> point, do we? There's little to be gained from a major refactoring, just
> to find out that alternative engines we thought of didn't turn out to be
> a good fit for production for whatever reasons.
>
> On the other hand, if RocksDB is (by whatever standards) a better
> storage implementation, why not completely switch, instead of just
> making it an option? But if it's not, is a major refactoring still worth
> it?
>
>
> On 03.11.17 19:22, Dikang Gu wrote:
> > Hi,
> >
> > We are having discussions about the pluggable storage engine plan on the
> > jira: https://issues.apache.org/jira/browse/CASSANDRA-13475.
> >
> > We are trying to figure out a plan for the pluggable storage engine
> effort.
> > Right now, the discussion is mainly happening between couple C*
> committers,
> > like Blake and me. But I want to increase the visibility, and I'm very
> > welcome more developers to be involved in the discussion. It will help us
> > on moving forward on this effort.
> >
> > Also, I have a quip as a (very high level) design doc for this project.
> > https://quip.com/bhw5ABUCi3co
> >
> > Thanks
> > Dikang.
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


-- 
Dikang


Pluggable storage engine discussion

2017-11-03 Thread Dikang Gu
Hi,

We are having discussions about the pluggable storage engine plan on the
jira: https://issues.apache.org/jira/browse/CASSANDRA-13475.

We are trying to figure out a plan for the pluggable storage engine effort.
Right now, the discussion is mainly happening between couple C* committers,
like Blake and me. But I want to increase the visibility, and I'm very
welcome more developers to be involved in the discussion. It will help us
on moving forward on this effort.

Also, I have a quip as a (very high level) design doc for this project.
https://quip.com/bhw5ABUCi3co

Thanks
Dikang.


Re: Cassandra pluggable storage engine (update)

2017-10-04 Thread Dikang Gu
Hi DuyHai,

Good point! At this moment, I do not see anything really prevent us from
having one storage engine type per table, we are using one RocksDB instance
per table anyway. However, we want to do simple things first, and it's
easier for us to have storage engine per keyspace, for both development and
our internal deployment. We can revisit the choice if there are strong
needs for storage engine per table.

Thanks
Dikang.

On Wed, Oct 4, 2017 at 1:54 PM, DuyHai Doan  wrote:

> Excellent docs, thanks for the update Dikang.
>
> A question about a design choice, is there any technical reason to specify
> the storage engine at keyspace level rather than table level ?
>
> It's not overly complicated to move all tables sharing the same storage
> engine into the same keyspace but then it makes tables organization
> strongly tied to technical storage engine choice rather than functional
> splitting
>
> Regards
>
> On Wed, Oct 4, 2017 at 10:47 PM, Dikang Gu  wrote:
>
> > Hi Blake,
> >
> > Great questions!
> >
> > 1. Yeah, we implement the encoding algorithms, which could encode C* data
> > types into byte array, and keep the same sorting order. Our
> implementation
> > is based on the orderly lib used in HBase,
> > https://github.com/ndimiduk/orderly .
> > 2. Repair is not supported yet, we are still working on figure out the
> work
> > need to be done to support repair or incremental repair.
> >
> > Thanks
> > Dikang.
> >
> > On Wed, Oct 4, 2017 at 1:39 PM, Blake Eggleston 
> > wrote:
> >
> > > Hi Dikang,
> > >
> > > Cool stuff. 2 questions. Based on your presentation at ngcc, it seems
> > like
> > > rocks db stores things in byte order. Does this mean that you have code
> > > that makes each of the existing types byte comparable, or is clustering
> > > order implementation dependent? Also, I don't see anything in the draft
> > api
> > > that seems to support splitting the data set into arbitrary categories
> > (ie
> > > repaired and unrepaired data living in the same token range). Is
> support
> > > for incremental repair planned for v1?
> > >
> > > Thanks,
> > >
> > > Blake
> > >
> > >
> > > On October 4, 2017 at 1:28:01 PM, Dikang Gu (dikan...@gmail.com)
> wrote:
> > >
> > > Hello C* developers:
> > >
> > > In my previous email (https://www.mail-archive.com/
> > > dev@cassandra.apache.org/msg11024.html), I presented that Instagram
> was
> > > kicking off a project to make C*'s storage engine to be pluggable, as
> > other
> > > modern databases, like mysql, mongoDB etc, so that users will be able
> to
> > > choose most suitable storage engine for different work load, or to use
> > > different features. In addition to that, a pluggable storage engine
> > > architecture will improve the modularity of the system, help to
> increase
> > > the testability and reliability of Cassandra.
> > >
> > > After months of development and testing, we'd like to share the work we
> > > have done, including the first(draft) version of the C* storage engine
> > API,
> > > and the first version of the RocksDB based storage engine.
> > >
> > > ​
> > >
> > >
> > > For the C* storage engine API, here is the draft version we proposed,
> > > https://docs.google.com/document/d/1PxYm9oXW2jJtSDiZ-
> > > SR9O20jud_0jnA-mW7ttp2dVmk/edit. It contains the APIs for read/write
> > > requests, streaming, and table management. The storage engine related
> > > functionalities, like data encoding/decoding format, on-disk data
> > > read/write, compaction, etc, will be taken care by the storage engine
> > > implementation.
> > >
> > > Each storage engine is a class with each instance of the class is
> stored
> > > in the Keyspace instance. So all the column families within a keyspace
> > will
> > > share one storage engine instance.
> > >
> > > Once a storage engine instance is created, Cassandra sever issues
> > commands
> > > to the engine instance to performance data storage and retrieval tasks
> > such
> > > as opening a column family, managing column families and streaming.
> > >
> > > How to config storage engine for different keyspaces? It's still open
> for
> > > discussion. One proposal is that we can add the storage engine option
> in
> > > the create keyspace cql command, and potentially we can overwrite the

Re: Cassandra pluggable storage engine (update)

2017-10-04 Thread Dikang Gu
Hi Blake,

Great questions!

1. Yeah, we implement the encoding algorithms, which could encode C* data
types into byte array, and keep the same sorting order. Our implementation
is based on the orderly lib used in HBase,
https://github.com/ndimiduk/orderly .
2. Repair is not supported yet, we are still working on figure out the work
need to be done to support repair or incremental repair.

Thanks
Dikang.

On Wed, Oct 4, 2017 at 1:39 PM, Blake Eggleston 
wrote:

> Hi Dikang,
>
> Cool stuff. 2 questions. Based on your presentation at ngcc, it seems like
> rocks db stores things in byte order. Does this mean that you have code
> that makes each of the existing types byte comparable, or is clustering
> order implementation dependent? Also, I don't see anything in the draft api
> that seems to support splitting the data set into arbitrary categories (ie
> repaired and unrepaired data living in the same token range). Is support
> for incremental repair planned for v1?
>
> Thanks,
>
> Blake
>
>
> On October 4, 2017 at 1:28:01 PM, Dikang Gu (dikan...@gmail.com) wrote:
>
> Hello C* developers:
>
> In my previous email (https://www.mail-archive.com/
> dev@cassandra.apache.org/msg11024.html), I presented that Instagram was
> kicking off a project to make C*'s storage engine to be pluggable, as other
> modern databases, like mysql, mongoDB etc, so that users will be able to
> choose most suitable storage engine for different work load, or to use
> different features. In addition to that, a pluggable storage engine
> architecture will improve the modularity of the system, help to increase
> the testability and reliability of Cassandra.
>
> After months of development and testing, we'd like to share the work we
> have done, including the first(draft) version of the C* storage engine API,
> and the first version of the RocksDB based storage engine.
>
> ​
>
>
> For the C* storage engine API, here is the draft version we proposed,
> https://docs.google.com/document/d/1PxYm9oXW2jJtSDiZ-
> SR9O20jud_0jnA-mW7ttp2dVmk/edit. It contains the APIs for read/write
> requests, streaming, and table management. The storage engine related
> functionalities, like data encoding/decoding format, on-disk data
> read/write, compaction, etc, will be taken care by the storage engine
> implementation.
>
> Each storage engine is a class with each instance of the class is stored
> in the Keyspace instance. So all the column families within a keyspace will
> share one storage engine instance.
>
> Once a storage engine instance is created, Cassandra sever issues commands
> to the engine instance to performance data storage and retrieval tasks such
> as opening a column family, managing column families and streaming.
>
> How to config storage engine for different keyspaces? It's still open for
> discussion. One proposal is that we can add the storage engine option in
> the create keyspace cql command, and potentially we can overwrite the
> option per C* node in its config file.
>
> Under that API, we implemented a new storage engine, based on RocksDB,
> called RocksEngine. In long term, we want to support most of C* existing
> features in RocksEngine, and we want to build it in a progressive manner.
> For the first version of the RocksDBEngine, we support following features:
> Most of non-nested data types
> Table schema
> Point query
> Range query
> Mutations
> Timestamp
> TTL
> Deletions/Cell tombstones
> Streaming
> We do not supported following features in first version yet:
> Multi-partition query
> Nested data types
> Counters
> Range tombstone
> Materialized views
> Secondary indexes
> SASI
> Repair
> At this moment, we've implemented the V1 features, and deployed it to our
> shadow cluster. Using shadowing traffic of our production use cases, we saw
> ~3X P99 read latency drop, compared to our C* 2.2 prod clusters. Here are
> some detailed metrics: https://docs.google.com/document/d/1DojHPteDPSphO0_
> N2meZ3zkmqlidRwwe_cJpsXLcp10.
>
> So if you need the features in existing storage engine, please keep using
> the existing storage engine. If you want to have a more predictable and
> lower read latency, also the features supported by RocksEngine are enough
> for your use cases, then RocksEngine could be a fit for you.
>
> The work is 1% finished, and we want to work together with community to
> make it happen. We presented the work in NGCC last week, and also pushed
> the beta version of the pluggable storage engine to Instagram github
> Cassandra repo, rocks_3.0 branch (https://github.com/Instagram/
> cassandra/tree/rocks_3.0), which is based on C* 3.0.12, please feel free
> to play with it! You can download it and follow the instructions (
> https://github.com/Instagram/cassandra/blob/rocks_3.0/StorageEngine.md)
> to try it out in your test environment, your feedback will be very valuable
> to us.
>
> Thanks
> Dikang.
>
>


-- 
Dikang


Cassandra pluggable storage engine (update)

2017-10-04 Thread Dikang Gu
Hello C* developers:

In my previous email (
https://www.mail-archive.com/dev@cassandra.apache.org/msg11024.html), I
presented that Instagram was kicking off a project to make C*'s storage
engine to be pluggable, as other modern databases, like mysql, mongoDB etc,
so that users will be able to choose most suitable storage engine for
different work load, or to use different features. In addition to that, a
pluggable storage engine architecture will improve the modularity of the
system, help to increase the testability and reliability of Cassandra.

After months of development and testing, we'd like to share the work we
have done, including the first(draft) version of the C* storage engine API,
and the first version of the RocksDB based storage engine.

​


For the C* storage engine API, here is the draft version we proposed,
https://docs.google.com/document/d/1PxYm9oXW2jJtSDiZ-SR9O20jud_0jnA-mW7ttp2dVmk/edit.
It contains the APIs for read/write requests, streaming, and table
management. The storage engine related functionalities, like data
encoding/decoding format, on-disk data read/write, compaction, etc, will be
taken care by the storage engine implementation.

Each storage engine is a class with each instance of the class is stored in
the Keyspace instance. So all the column families within a keyspace will
share one storage engine instance.

Once a storage engine instance is created, Cassandra sever issues commands
to the engine instance to performance data storage and retrieval tasks such
as opening a column family, managing column families and streaming.

How to config storage engine for different keyspaces? It's still open for
discussion. One proposal is that we can add the storage engine option in
the create keyspace cql command, and potentially we can overwrite the
option per C* node in its config file.

Under that API, we implemented a new storage engine, based on RocksDB,
called RocksEngine. In long term, we want to support most of C* existing
features in RocksEngine, and we want to build it in a progressive manner.
For the first version of the RocksDBEngine, we support following features:

   - Most of non-nested data types
   - Table schema
   - Point query
   - Range query
   - Mutations
   - Timestamp
   - TTL
   - Deletions/Cell tombstones
   - Streaming

We do not supported following features in first version yet:

   - Multi-partition query
   - Nested data types
   - Counters
   - Range tombstone
   - Materialized views
   - Secondary indexes
   - SASI
   - Repair

At this moment, we've implemented the V1 features, and deployed it to our
shadow cluster. Using shadowing traffic of our production use cases, we saw
~3X P99 read latency drop, compared to our C* 2.2 prod clusters. Here are
some detailed metrics:
https://docs.google.com/document/d/1DojHPteDPSphO0_N2meZ3zkmqlidRwwe_cJpsXLcp10.


So if you need the features in existing storage engine, please keep using
the existing storage engine. If you want to have a more predictable and
lower read latency, also the features supported by RocksEngine are enough
for your use cases, then RocksEngine could be a fit for you.

The work is 1% finished, and we want to work together with community to
make it happen. We presented the work in NGCC last week, and also pushed
the beta version of the pluggable storage engine to Instagram github
Cassandra repo, rocks_3.0 branch (
https://github.com/Instagram/cassandra/tree/rocks_3.0), which is based on
C* 3.0.12, please feel free to play with it! You can download it and follow
the instructions (
https://github.com/Instagram/cassandra/blob/rocks_3.0/StorageEngine.md) to
try it out in your test environment, your feedback will be very valuable to
us.

Thanks
Dikang.


Re: Commitlog without header

2017-09-22 Thread Dikang Gu
I will try the fixes, thanks Benjamin & Jeff.

On Thu, Sep 21, 2017 at 8:55 PM, Jeff Jirsa  wrote:

> https://issues.apache.org/jira/plugins/servlet/mobile#
> issue/CASSANDRA-11995
>
>
>
> --
> Jeff Jirsa
>
>
> On Sep 19, 2017, at 4:36 PM, Dikang Gu  wrote:
>
> Hello,
>
> In our production cluster, we had multiple times that after a *unclean*
> shutdown, cassandra sever can not start due to commit log exceptions:
>
> 2017-09-17_06:06:32.49830 ERROR 06:06:32 [main]: Exiting due to error while
> processing commit log during initialization.
> 2017-09-17_06:06:32.49831
> org.apache.cassandra.db.commitlog.CommitLogReplayer$
> CommitLogReplayException:
> Could not read commit log descriptor in file
> /data/cassandra/commitlog/CommitLog-5-1503088780367.log
> 2017-09-17_06:06:32.49831 at
> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(
> CommitLogReplayer.java:634)
> [apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
> 2017-09-17_06:06:32.49831 at
> org.apache.cassandra.db.commitlog.CommitLogReplayer.
> recover(CommitLogReplayer.java:303)
> [apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
> 2017-09-17_06:06:32.49831 at
> org.apache.cassandra.db.commitlog.CommitLogReplayer.
> recover(CommitLogReplayer.java:147)
> [apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
> 2017-09-17_06:06:32.49832 at
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:189)
> [apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
> 2017-09-17_06:06:32.49832 at
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:169)
> [apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
> 2017-09-17_06:06:32.49832 at
> org.apache.cassandra.service.CassandraDaemon.setup(
> CassandraDaemon.java:302)
> [apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
> 2017-09-17_06:06:32.49832 at
> org.apache.cassandra.service.CassandraDaemon.activate(
> CassandraDaemon.java:544)
> [apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
> 2017-09-17_06:06:32.49832 at
> org.apache.cassandra.service.CassandraDaemon.main(
> CassandraDaemon.java:607)
> [apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
>
> I add some logging to the CommitLogDescriptor.readHeader(), and find the
> header is empty in the failure case. By empty, I mean all the fields in the
> header are 0:
>
> 2017-09-19_22:43:02.22112 INFO  22:43:02 [main]: Dikang: crc: 0, checkcrc:
> 2077607535 <(207)%20760-7535>
> 2017-09-19_22:43:02.22130 INFO  22:43:02 [main]: Dikang: version: 0, id: 0,
> parametersLength: 0
>
> As a result, it did not pass the crc check, and failed the commit log
> replay.
>
> My question is: is it a known issue that some race condition can cause
> empty header in commit log? If so, it should be safe just skip last commit
> log with empty header, right?
>
> As you can see, we are using Cassandra 2.2.5.
>
> Thanks
> Dikang.
>
>


-- 
Dikang


Commitlog without header

2017-09-19 Thread Dikang Gu
Hello,

In our production cluster, we had multiple times that after a *unclean*
shutdown, cassandra sever can not start due to commit log exceptions:

2017-09-17_06:06:32.49830 ERROR 06:06:32 [main]: Exiting due to error while
processing commit log during initialization.
2017-09-17_06:06:32.49831
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Could not read commit log descriptor in file
/data/cassandra/commitlog/CommitLog-5-1503088780367.log
2017-09-17_06:06:32.49831 at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:634)
[apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
2017-09-17_06:06:32.49831 at
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:303)
[apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
2017-09-17_06:06:32.49831 at
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:147)
[apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
2017-09-17_06:06:32.49832 at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:189)
[apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
2017-09-17_06:06:32.49832 at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:169)
[apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
2017-09-17_06:06:32.49832 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:302)
[apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
2017-09-17_06:06:32.49832 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:544)
[apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]
2017-09-17_06:06:32.49832 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:607)
[apache-cassandra-2.2.5+git20170612.e1857fa.jar:2.2.5+git20170612.e1857fa]

I add some logging to the CommitLogDescriptor.readHeader(), and find the
header is empty in the failure case. By empty, I mean all the fields in the
header are 0:

2017-09-19_22:43:02.22112 INFO  22:43:02 [main]: Dikang: crc: 0, checkcrc:
2077607535
2017-09-19_22:43:02.22130 INFO  22:43:02 [main]: Dikang: version: 0, id: 0,
parametersLength: 0

As a result, it did not pass the crc check, and failed the commit log
replay.

My question is: is it a known issue that some race condition can cause
empty header in commit log? If so, it should be safe just skip last commit
log with empty header, right?

As you can see, we are using Cassandra 2.2.5.

Thanks
Dikang.


Re: Definition of QUORUM consistency level

2017-06-28 Thread Dikang Gu
https://issues.apache.org/jira/browse/CASSANDRA-13645

On Wed, Jun 28, 2017 at 4:59 PM, Dikang Gu  wrote:

> We implement the patch internally, and deploy to our production clusters,
> we see 2X drop of the P99 quorum read latency, because we can reduce one
> unnecessary cross region read. This is a huge improvement since performance
> is very critical to our customers.
>
> Again, I'm not trying to change the definition of the QUORUM consistency
> level, instead, we want to improve the quorum read latency, by removing
> unnecessary replica requests, which I think can benefit Cassandra users in
> general.
>
> I will create a JIRA, and we can move discussions there.
>
>
> Thanks!
> ​
>
> On Thu, Jun 8, 2017 at 10:12 PM, Jeff Jirsa  wrote:
>
>> Short of actually making ConsistencyLevel pluggable or adding/changing
>> one of the existing levels, an alternative approach would be to divide up
>> the cluster into either real or pseudo-datacenters (with RF=2 in each DC),
>> and then write with QUORUM (which would be 3 nodes, across any combination
>> of datacenters), and read with LOCAL_QUORUM (which would be 2 nodes in the
>> datacenter of the coordinator). You don't have to have distinct physical
>> DCs for this, but you'd need tooling to guarantee an even number of
>> replicas in each virtual datacenter.
>>
>> It's an ugly workaround, but it'd work.
>>
>> Pluggable CL would be nicer, though.
>>
>>
>> On Thu, Jun 8, 2017 at 9:51 PM, Justin Cameron 
>> wrote:
>>
>>> Firstly, this situation only occurs if you need strong consistency and
>>> are
>>> using an even replication factor (RF4, RF6, etc).
>>> Secondly, either the read or write still need to be performed at a
>>> minimum
>>> level of QUORUM. This means there are no extra availability benefits from
>>> your proposal (i.e. a minimum of QUORUM replicas still need to be online
>>> and available)
>>>
>>> So the only potential benefit I can think of is a theoretical performance
>>> boost. If you write with QUORUM, then you'll need to read with
>>> QUORUM-1/HALF (e.g. RF4, write with QUORUM, read with TWO, RF6 write with
>>> QUORUM, read with THREE, RF8 write with QUORUM, read with FOUR, ...). At
>>> most you'd only reduce the number of replicas that the client needs to
>>> block on by 1.
>>>
>>> I'd guess that the performance benefits that you'd gain will probably be
>>> quite small - but I'd happily be proven wrong if you feel like running
>>> some
>>> benchmarks :)
>>>
>>> Cheers,
>>> Justin
>>>
>>> On Fri, 9 Jun 2017 at 14:26 Brandon Williams  wrote:
>>>
>>> > I don't disagree with you there and have never liked TWO/THREE.  This
>>> is
>>> > somewhat relevant: https://issues.apache.org/jira
>>> /browse/CASSANDRA-2338
>>> >
>>> > I don't think going to CL.FOUR, etc, is a good long-term solution, but
>>> I'm
>>> > also not sure what is.
>>> >
>>> >
>>> > On Thu, Jun 8, 2017 at 11:20 PM, Dikang Gu  wrote:
>>> >
>>> >> To me, CL.TWO and CL.THREE are more like work around of the problem,
>>> for
>>> >> example, they do not work if the number of replicas go to 8, which
>>> does
>>> >> possible in our environment (2 replicas in each of 4 DCs).
>>> >>
>>> >> What people want from quorum is strong consistency guarantee, as long
>>> as
>>> >> R+W > N, there are three options: a) R=W=(n/2+1); b) R=(n/2),
>>> W=(n/2+1); c)
>>> >> R=(n/2+1), W=(n/2). What Cassandra doing right now, is the option a),
>>> which
>>> >> is the most expensive option.
>>> >>
>>> >> I can not think of a reason, that people want the quorum read, not for
>>> >> strong consistency reason, but just to read from (n/2+1) nodes. If
>>> they
>>> >> want strong consistency, then the read just needs (n/2) nodes, we are
>>> >> purely waste the one extra request, and hurts read latency as well.
>>> >>
>>> >> Thanks
>>> >> Dikang.
>>> >>
>>> >> On Thu, Jun 8, 2017 at 8:20 PM, Nate McCall 
>>> >> wrote:
>>> >>
>>> >>>
>>> >>> We have CL.TWO.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>> This was actually the original motivation for CL.TWO and CL.THREE if
>>> >>> memory serves:
>>> >>> https://issues.apache.org/jira/browse/CASSANDRA-2013
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Dikang
>>> >>
>>> >>
>>> > --
>>>
>>>
>>> *Justin Cameron*Senior Software Engineer
>>>
>>>
>>> <https://www.instaclustr.com/>
>>>
>>>
>>> This email has been sent on behalf of Instaclustr Pty. Limited
>>> (Australia)
>>> and Instaclustr Inc (USA).
>>>
>>> This email and any attachments may contain confidential and legally
>>> privileged information.  If you are not the intended recipient, do not
>>> copy
>>> or disclose its content, but please reply to this email immediately and
>>> highlight the error to the sender and then immediately delete the
>>> message.
>>>
>>
>>
>
>
> --
> Dikang
>
>


-- 
Dikang


Re: Definition of QUORUM consistency level

2017-06-28 Thread Dikang Gu
We implement the patch internally, and deploy to our production clusters,
we see 2X drop of the P99 quorum read latency, because we can reduce one
unnecessary cross region read. This is a huge improvement since performance
is very critical to our customers.

Again, I'm not trying to change the definition of the QUORUM consistency
level, instead, we want to improve the quorum read latency, by removing
unnecessary replica requests, which I think can benefit Cassandra users in
general.

I will create a JIRA, and we can move discussions there.


Thanks!
​

On Thu, Jun 8, 2017 at 10:12 PM, Jeff Jirsa  wrote:

> Short of actually making ConsistencyLevel pluggable or adding/changing one
> of the existing levels, an alternative approach would be to divide up the
> cluster into either real or pseudo-datacenters (with RF=2 in each DC), and
> then write with QUORUM (which would be 3 nodes, across any combination of
> datacenters), and read with LOCAL_QUORUM (which would be 2 nodes in the
> datacenter of the coordinator). You don't have to have distinct physical
> DCs for this, but you'd need tooling to guarantee an even number of
> replicas in each virtual datacenter.
>
> It's an ugly workaround, but it'd work.
>
> Pluggable CL would be nicer, though.
>
>
> On Thu, Jun 8, 2017 at 9:51 PM, Justin Cameron 
> wrote:
>
>> Firstly, this situation only occurs if you need strong consistency and are
>> using an even replication factor (RF4, RF6, etc).
>> Secondly, either the read or write still need to be performed at a minimum
>> level of QUORUM. This means there are no extra availability benefits from
>> your proposal (i.e. a minimum of QUORUM replicas still need to be online
>> and available)
>>
>> So the only potential benefit I can think of is a theoretical performance
>> boost. If you write with QUORUM, then you'll need to read with
>> QUORUM-1/HALF (e.g. RF4, write with QUORUM, read with TWO, RF6 write with
>> QUORUM, read with THREE, RF8 write with QUORUM, read with FOUR, ...). At
>> most you'd only reduce the number of replicas that the client needs to
>> block on by 1.
>>
>> I'd guess that the performance benefits that you'd gain will probably be
>> quite small - but I'd happily be proven wrong if you feel like running
>> some
>> benchmarks :)
>>
>> Cheers,
>> Justin
>>
>> On Fri, 9 Jun 2017 at 14:26 Brandon Williams  wrote:
>>
>> > I don't disagree with you there and have never liked TWO/THREE.  This is
>> > somewhat relevant: https://issues.apache.org/jira/browse/CASSANDRA-2338
>> >
>> > I don't think going to CL.FOUR, etc, is a good long-term solution, but
>> I'm
>> > also not sure what is.
>> >
>> >
>> > On Thu, Jun 8, 2017 at 11:20 PM, Dikang Gu  wrote:
>> >
>> >> To me, CL.TWO and CL.THREE are more like work around of the problem,
>> for
>> >> example, they do not work if the number of replicas go to 8, which does
>> >> possible in our environment (2 replicas in each of 4 DCs).
>> >>
>> >> What people want from quorum is strong consistency guarantee, as long
>> as
>> >> R+W > N, there are three options: a) R=W=(n/2+1); b) R=(n/2),
>> W=(n/2+1); c)
>> >> R=(n/2+1), W=(n/2). What Cassandra doing right now, is the option a),
>> which
>> >> is the most expensive option.
>> >>
>> >> I can not think of a reason, that people want the quorum read, not for
>> >> strong consistency reason, but just to read from (n/2+1) nodes. If they
>> >> want strong consistency, then the read just needs (n/2) nodes, we are
>> >> purely waste the one extra request, and hurts read latency as well.
>> >>
>> >> Thanks
>> >> Dikang.
>> >>
>> >> On Thu, Jun 8, 2017 at 8:20 PM, Nate McCall 
>> >> wrote:
>> >>
>> >>>
>> >>> We have CL.TWO.
>> >>>>
>> >>>>
>> >>>>
>> >>> This was actually the original motivation for CL.TWO and CL.THREE if
>> >>> memory serves:
>> >>> https://issues.apache.org/jira/browse/CASSANDRA-2013
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Dikang
>> >>
>> >>
>> > --
>>
>>
>> *Justin Cameron*Senior Software Engineer
>>
>>
>> <https://www.instaclustr.com/>
>>
>>
>> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
>> and Instaclustr Inc (USA).
>>
>> This email and any attachments may contain confidential and legally
>> privileged information.  If you are not the intended recipient, do not
>> copy
>> or disclose its content, but please reply to this email immediately and
>> highlight the error to the sender and then immediately delete the message.
>>
>
>


-- 
Dikang


Re: Definition of QUORUM consistency level

2017-06-08 Thread Dikang Gu
To me, CL.TWO and CL.THREE are more like work around of the problem, for
example, they do not work if the number of replicas go to 8, which does
possible in our environment (2 replicas in each of 4 DCs).

What people want from quorum is strong consistency guarantee, as long as
R+W > N, there are three options: a) R=W=(n/2+1); b) R=(n/2), W=(n/2+1); c)
R=(n/2+1), W=(n/2). What Cassandra doing right now, is the option a), which
is the most expensive option.

I can not think of a reason, that people want the quorum read, not for
strong consistency reason, but just to read from (n/2+1) nodes. If they
want strong consistency, then the read just needs (n/2) nodes, we are
purely waste the one extra request, and hurts read latency as well.

Thanks
Dikang.

On Thu, Jun 8, 2017 at 8:20 PM, Nate McCall  wrote:

>
> We have CL.TWO.
>>
>>
>>
> This was actually the original motivation for CL.TWO and CL.THREE if
> memory serves:
> https://issues.apache.org/jira/browse/CASSANDRA-2013
>



-- 
Dikang


Re: Definition of QUORUM consistency level

2017-06-08 Thread Dikang Gu
So, for the quorum, what we really want is that there is one overlap among
the nodes in write path and read path. It actually was my assumption for a
long time that we need (N/2 + 1) for write and just need (N/2) for read,
because it's enough to provide the strong consistency.

On Thu, Jun 8, 2017 at 7:47 PM, Jonathan Haddad  wrote:

> It would be a little weird to change the definition of QUORUM, which means
> majority, to mean something other than majority for a single use case.
> Sounds like you want to introduce a new CL, HALF.
> On Thu, Jun 8, 2017 at 7:43 PM Dikang Gu  wrote:
>
>> Justin, what I suggest is that for QUORUM consistent level, the block for
>> write should be (num_replica/2)+1, this is same as today, but for read
>> request, we just need to access (num_replica/2) nodes, which should provide
>> enough strong consistency.
>>
>> Dikang.
>>
>> On Thu, Jun 8, 2017 at 7:38 PM, Justin Cameron 
>> wrote:
>>
>>> 2/4 for write and 2/4 for read would not be sufficient to achieve strong
>>> consistency, as there is no overlap.
>>>
>>> In your particular case you could potentially use QUORUM for write and
>>> TWO for read (or vice-versa) and still achieve strong consistency. If you
>>> add additional nodes in the future this would obviously no longer work.
>>> Also the benefit of this is dubious, since 3/4 nodes still need to be
>>> accessible to perform writes. I'd also guess that it's unlikely to provide
>>> any significant performance increase.
>>>
>>> Justin
>>>
>>> On Fri, 9 Jun 2017 at 12:29 Dikang Gu  wrote:
>>>
>>>> Hello there,
>>>>
>>>> We have some use cases are doing consistent read/write requests, and we
>>>> have 4 replicas in that cluster, according to our setup.
>>>>
>>>> What's interesting to me is that, for both read and write quorum
>>>> requests, they are blocked for 4/2+1 = 3 replicas, so we are accessing 3
>>>> (for write) + 3 (for reads) = 6 replicas in quorum requests, which is 2
>>>> replicas more than 4.
>>>>
>>>> I think it's not necessary to have 2 overlap nodes in even replication
>>>> factor case.
>>>>
>>>> I suggest to change the `quorumFor(keyspace)` code, separate the case
>>>> for read and write requests, so that we can reduce one replica request in
>>>> read path.
>>>>
>>>> Any concerns?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> --
>>>> Dikang
>>>>
>>>> --
>>>
>>>
>>> *Justin Cameron*Senior Software Engineer
>>>
>>>
>>> <https://www.instaclustr.com/>
>>>
>>>
>>> This email has been sent on behalf of Instaclustr Pty. Limited
>>> (Australia) and Instaclustr Inc (USA).
>>>
>>> This email and any attachments may contain confidential and legally
>>> privileged information.  If you are not the intended recipient, do not copy
>>> or disclose its content, but please reply to this email immediately and
>>> highlight the error to the sender and then immediately delete the message.
>>>
>>
>>
>>
>> --
>> Dikang
>>
>>


-- 
Dikang


Re: Definition of QUORUM consistency level

2017-06-08 Thread Dikang Gu
Justin, what I suggest is that for QUORUM consistent level, the block for
write should be (num_replica/2)+1, this is same as today, but for read
request, we just need to access (num_replica/2) nodes, which should provide
enough strong consistency.

Dikang.

On Thu, Jun 8, 2017 at 7:38 PM, Justin Cameron 
wrote:

> 2/4 for write and 2/4 for read would not be sufficient to achieve strong
> consistency, as there is no overlap.
>
> In your particular case you could potentially use QUORUM for write and TWO
> for read (or vice-versa) and still achieve strong consistency. If you add
> additional nodes in the future this would obviously no longer work. Also
> the benefit of this is dubious, since 3/4 nodes still need to be accessible
> to perform writes. I'd also guess that it's unlikely to provide any
> significant performance increase.
>
> Justin
>
> On Fri, 9 Jun 2017 at 12:29 Dikang Gu  wrote:
>
>> Hello there,
>>
>> We have some use cases are doing consistent read/write requests, and we
>> have 4 replicas in that cluster, according to our setup.
>>
>> What's interesting to me is that, for both read and write quorum
>> requests, they are blocked for 4/2+1 = 3 replicas, so we are accessing 3
>> (for write) + 3 (for reads) = 6 replicas in quorum requests, which is 2
>> replicas more than 4.
>>
>> I think it's not necessary to have 2 overlap nodes in even replication
>> factor case.
>>
>> I suggest to change the `quorumFor(keyspace)` code, separate the case for
>> read and write requests, so that we can reduce one replica request in read
>> path.
>>
>> Any concerns?
>>
>> Thanks!
>>
>>
>> --
>> Dikang
>>
>> --
>
>
> *Justin Cameron*Senior Software Engineer
>
>
> <https://www.instaclustr.com/>
>
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>



-- 
Dikang


Definition of QUORUM consistency level

2017-06-08 Thread Dikang Gu
Hello there,

We have some use cases are doing consistent read/write requests, and we
have 4 replicas in that cluster, according to our setup.

What's interesting to me is that, for both read and write quorum requests,
they are blocked for 4/2+1 = 3 replicas, so we are accessing 3 (for write)
+ 3 (for reads) = 6 replicas in quorum requests, which is 2 replicas more
than 4.

I think it's not necessary to have 2 overlap nodes in even replication
factor case.

I suggest to change the `quorumFor(keyspace)` code, separate the case for
read and write requests, so that we can reduce one replica request in read
path.

Any concerns?

Thanks!


-- 
Dikang


Re: Cassandra on RocksDB experiment result

2017-04-26 Thread Dikang Gu
@Samba, that's a very good point, I definitely do not expect all storage
engines provide exactly same features, and each storage engine should have
it's own strength and sweet spots as well. For features not supported by
certain storage engine, I think it should throw exceptions and fail the
requests. It will be better than swallowing the errors silently.

On Wed, Apr 26, 2017 at 7:40 AM, Samba  wrote:

> some features may work with some storage engine but with others; for
> example, storing large blobs may be efficient in one storage engine while
> quite worse in another. perhaps some storage engines may want to SKIP some
> features or add more.
>
> if a storage engine skips a feature, how should the query executor handle
> the response or lack of it?
> if a storage engine provides a new feature, how should that be enabled for
> that particular storage engine alone?
>
> On Wed, Apr 26, 2017 at 5:07 AM, Dikang Gu  wrote:
>
> > I created several tickets to start the discussion, please free feel to
> > comment on the JIRAs. I'm also open for suggestions about other efficient
> > ways to discuss it.
> >
> > https://issues.apache.org/jira/browse/CASSANDRA-13474
> > https://issues.apache.org/jira/browse/CASSANDRA-13475
> > https://issues.apache.org/jira/browse/CASSANDRA-13476
> >
> > Thanks
> > Dikang.
> >
> > On Mon, Apr 24, 2017 at 9:53 PM, Dikang Gu  wrote:
> >
> > > Thanks everyone for the feedback and suggestions! They are all very
> > > helpful. I'm looking forward to having more discussions about the
> > > implementation details.
> > >
> > > As the next step, we will be focus on three areas:
> > > 1. Pluggable storage engine interface.
> > > 2. Wide column support on RocksDB.
> > > 3. Streaming support on RocksDB.
> > >
> > > I will go ahead and create some JIRAs, to start the discussion about
> > > pluggable storage interface, and how to plug RocksDB into Cassandra.
> > >
> > > Please let me know your thoughts.
> > >
> > > Thanks!
> > > Dikang.
> > >
> > > On Mon, Apr 24, 2017 at 12:42 PM, Patrick McFadin 
> > > wrote:
> > >
> > >> Dikang,
> > >>
> > >> First I want to thank you and everyone else at Instragram for the
> > >> engineering talent you have devoted to the Cassandra project. Here's
> yet
> > >> another great example.
> > >>
> > >> He's going to hate me for dragging him into this, but Vijay
> > Parthasarathy
> > >> has done some exploratory work before on integrating non-java storage
> to
> > >> Cassandra. Might be helpful person to consult.
> > >>
> > >> Patrick
> > >>
> > >>
> > >>
> > >> On Sun, Apr 23, 2017 at 4:25 PM, Nate McCall 
> > wrote:
> > >>
> > >> > > Please take a look and let me know your thoughts. I think the
> > biggest
> > >> > > latency win comes from we get rid of most Java garbages created by
> > >> > current
> > >> > > read/write path and compactions, which reduces the JVM overhead
> and
> > >> makes
> > >> > > the latency to be more predictable.
> > >> > >
> > >> >
> > >> > I want to put this here for the record:
> > >> > https://issues.apache.org/jira/browse/CASSANDRA-2995
> > >> >
> > >> > There are some valid points in the above about increased surface
> area
> > >> > and end-user confusion. That said, just under six years is a long
> > >> > time. I think we are a more mature project now and I completely
> agree
> > >> > with others about the positive impacts of testability this would
> > >> > inherently provide.
> > >> >
> > >> > +1 from me.
> > >> >
> > >> > Dikang, thank you for opening this discussion and sharing your
> efforts
> > >> so
> > >> > far.
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Dikang
> > >
> > >
> >
> >
> > --
> > Dikang
> >
>



-- 
Dikang


Re: Cassandra on RocksDB experiment result

2017-04-25 Thread Dikang Gu
I created several tickets to start the discussion, please free feel to
comment on the JIRAs. I'm also open for suggestions about other efficient
ways to discuss it.

https://issues.apache.org/jira/browse/CASSANDRA-13474
https://issues.apache.org/jira/browse/CASSANDRA-13475
https://issues.apache.org/jira/browse/CASSANDRA-13476

Thanks
Dikang.

On Mon, Apr 24, 2017 at 9:53 PM, Dikang Gu  wrote:

> Thanks everyone for the feedback and suggestions! They are all very
> helpful. I'm looking forward to having more discussions about the
> implementation details.
>
> As the next step, we will be focus on three areas:
> 1. Pluggable storage engine interface.
> 2. Wide column support on RocksDB.
> 3. Streaming support on RocksDB.
>
> I will go ahead and create some JIRAs, to start the discussion about
> pluggable storage interface, and how to plug RocksDB into Cassandra.
>
> Please let me know your thoughts.
>
> Thanks!
> Dikang.
>
> On Mon, Apr 24, 2017 at 12:42 PM, Patrick McFadin 
> wrote:
>
>> Dikang,
>>
>> First I want to thank you and everyone else at Instragram for the
>> engineering talent you have devoted to the Cassandra project. Here's yet
>> another great example.
>>
>> He's going to hate me for dragging him into this, but Vijay Parthasarathy
>> has done some exploratory work before on integrating non-java storage to
>> Cassandra. Might be helpful person to consult.
>>
>> Patrick
>>
>>
>>
>> On Sun, Apr 23, 2017 at 4:25 PM, Nate McCall  wrote:
>>
>> > > Please take a look and let me know your thoughts. I think the biggest
>> > > latency win comes from we get rid of most Java garbages created by
>> > current
>> > > read/write path and compactions, which reduces the JVM overhead and
>> makes
>> > > the latency to be more predictable.
>> > >
>> >
>> > I want to put this here for the record:
>> > https://issues.apache.org/jira/browse/CASSANDRA-2995
>> >
>> > There are some valid points in the above about increased surface area
>> > and end-user confusion. That said, just under six years is a long
>> > time. I think we are a more mature project now and I completely agree
>> > with others about the positive impacts of testability this would
>> > inherently provide.
>> >
>> > +1 from me.
>> >
>> > Dikang, thank you for opening this discussion and sharing your efforts
>> so
>> > far.
>> >
>>
>
>
>
> --
> Dikang
>
>


-- 
Dikang


Re: Cassandra on RocksDB experiment result

2017-04-24 Thread Dikang Gu
Thanks everyone for the feedback and suggestions! They are all very
helpful. I'm looking forward to having more discussions about the
implementation details.

As the next step, we will be focus on three areas:
1. Pluggable storage engine interface.
2. Wide column support on RocksDB.
3. Streaming support on RocksDB.

I will go ahead and create some JIRAs, to start the discussion about
pluggable storage interface, and how to plug RocksDB into Cassandra.

Please let me know your thoughts.

Thanks!
Dikang.

On Mon, Apr 24, 2017 at 12:42 PM, Patrick McFadin 
wrote:

> Dikang,
>
> First I want to thank you and everyone else at Instragram for the
> engineering talent you have devoted to the Cassandra project. Here's yet
> another great example.
>
> He's going to hate me for dragging him into this, but Vijay Parthasarathy
> has done some exploratory work before on integrating non-java storage to
> Cassandra. Might be helpful person to consult.
>
> Patrick
>
>
>
> On Sun, Apr 23, 2017 at 4:25 PM, Nate McCall  wrote:
>
> > > Please take a look and let me know your thoughts. I think the biggest
> > > latency win comes from we get rid of most Java garbages created by
> > current
> > > read/write path and compactions, which reduces the JVM overhead and
> makes
> > > the latency to be more predictable.
> > >
> >
> > I want to put this here for the record:
> > https://issues.apache.org/jira/browse/CASSANDRA-2995
> >
> > There are some valid points in the above about increased surface area
> > and end-user confusion. That said, just under six years is a long
> > time. I think we are a more mature project now and I completely agree
> > with others about the positive impacts of testability this would
> > inherently provide.
> >
> > +1 from me.
> >
> > Dikang, thank you for opening this discussion and sharing your efforts so
> > far.
> >
>



-- 
Dikang


Cassandra on RocksDB experiment result

2017-04-19 Thread Dikang Gu
Hi Cassandra developers,

This is Dikang from Instagram, I'd like to share you some experiment
results we did recently, to use RocksDB as Cassandra's storage engine. In
the experiment, I built a prototype to integrate Cassandra 3.0.12 and
RocksDB on single column (key-value) use case, shadowed one of our
production use case, and saw about 4-6X P99 read latency drop during peak
time, compared to 3.0.12. Also, the P99 latency became more predictable as
well.

Here is detailed note with more metrics:

https://docs.google.com/document/d/1Ztqcu8Jzh4USKoWBgDJQw82DBurQm
sV-PmfiJYvu_Dc/edit?usp=sharing

Please take a look and let me know your thoughts. I think the biggest
latency win comes from we get rid of most Java garbages created by current
read/write path and compactions, which reduces the JVM overhead and makes
the latency to be more predictable.

We are very excited about the potential performance gain. As the next step,
I propose to make the Cassandra storage engine to be pluggable (like Mysql
and MongoDB), and we are very interested in providing RocksDB as one
storage option with more predictable performance, together with community.

Thanks.

-- 
Dikang


Re: Weekly Cassandra Status

2017-03-27 Thread Dikang Gu
Jeff, thanks for the summary!

I will take a look at the token jira, https://issues.apache.org/
jira/browse/CASSANDRA-13348, since I was working on that recently.

--Dikang.

On Sun, Mar 26, 2017 at 3:35 PM, Jeff Jirsa  wrote:

> Email stuff:
> - We've moved github pull requests notifications from dev@ to pr@ - if you
> want to see when github issues are opened or updated, subscribe to the new
> list (send email to pr-subscr...@cassandra.apache.org ), or check out the
> archives at  https://lists.apache.org/list.html?p...@cassandra.apache.org
>
> We have some new JIRAs from new' contributors. Would be great if someone
> found the time to review and commit:
>
> - https://issues.apache.org/jira/browse/CASSANDRA-13358
> - https://issues.apache.org/jira/browse/CASSANDRA-13357
> - https://issues.apache.org/jira/browse/CASSANDRA-13356
>
> There are also two trivial typo changes in GH PRs:
> - https://github.com/apache/cassandra/pull/101
> - https://github.com/apache/cassandra/pull/102
>
> These are trivial to commit, and I'll likely do it "soon", but there's an
> open question about process: we've historically avoided using GH PRs, but I
> think ultimately it's better if we can find a way to accept these patches,
> especially with new contributors. I'm not aware of the what the standard is
> for whether or not we need to make a JIRA for tracking and update CHANGES -
> if any of the members of the PMC or committers feel like chiming in on
> procedure here, that'd be great. In the recent past, I've made a trivial
> JIRA just to follow the process, but it feels pretty silly ( GH PR #99 /
> https://github.com/apache/cassandra/commit/091e5fbe418004fd04390a0b1a
> 3486167360
> + https://issues.apache.org/jira/browse/CASSANDRA-13349 ).
>
> A pretty ugly duplicate-token blocker got registered this week: committers
> with free cycles may want to investigate (likely related to the new vnode
> allocation code in 3.0+):
>
> https://issues.apache.org/jira/browse/CASSANDRA-13348
>
> Here's a spattering of a few other patches (some new contributors, some
> old, but only patches from non-committers) who have patch available and no
> reviewer:
>
> - https://issues.apache.org/jira/browse/CASSANDRA-13369 - cql grammar
> issue
> - https://issues.apache.org/jira/browse/CASSANDRA-13374 - non-ascii dashes
> in docs
> - https://issues.apache.org/jira/browse/CASSANDRA-13354 - LCS estimated
> tasks accuracy
> - https://issues.apache.org/jira/browse/CASSANDRA-12748 - GREP_COLOR
> breaks
> startup
>
> We had 3 tickets opened to update embedded libraries to modern versions
> - JNA - https://issues.apache.org/jira/browse/CASSANDRA-13300 (committed)
> - Snappy - https://issues.apache.org/jira/browse/CASSANDRA-13336 (stilll
> open until I check the dtests a bit more closely)
> - JUnit - https://issues.apache.org/jira/browse/CASSANDRA-13360
> (committed)
>
> Since we had 3 of those in a short period of time, I've created an umbrella
> ticket for any other libraries we want to upgrade with 4.0 -
> https://issues.apache.org/jira/browse/CASSANDRA-13361 . We don't typically
> upgrade libraries until we have a reasons, so if you've been unable to use
> a feature of a bundled library because we were stuck on an old version, the
> time to update it in trunk is now (before 4.0 ships).
>
> Finally, we still have a TON of open "missing unit tests" tickets -
> https://issues.apache.org/jira/browse/CASSANDRA-9012 - new contributors
> may
> find those to be good places to have an immediate impact on the project.
>
> - Jeff
>



-- 
Dikang


Re: Hopefully Weekly Apache Cassandra JIRA Topics of Interest

2017-03-14 Thread Dikang Gu
This is awesome, Jeff!

On Sun, Mar 12, 2017 at 2:52 PM, DuyHai Doan  wrote:

> Static columns can already be indexed by custom 2nd index impl, for example
> SASI : https://issues.apache.org/jira/browse/CASSANDRA-11183
>
>
> On Sun, Mar 12, 2017 at 10:40 PM, Jeff Jirsa  wrote:
>
> > Hi folks,
> >
> > Cassandra JIRA is huge, active, and ever-changing. It's easy to miss
> > tickets you may care about, or if you want to start contributing,
> sometimes
> > it's hard to know where to start. I'm going to make an attempt to pick a
> > few dozen JIRAs each week (or month?) that would benefit from some more
> > eyeballs - if you have any tickets that you feel haven't gotten the
> > attention they deserve, feel free to add them to the list.
> >
> >
> > Topics for discussion (not necessarily a "go write a patch" type ticket,
> > but things that could use interested parties discussing options):
> >
> > https://issues.apache.org/jira/browse/CASSANDRA-12944 - Internal
> > diagnostic
> > events. Pretty involved proposal attached.
> > https://issues.apache.org/jira/browse/CASSANDRA-12912 - Warnings when
> the
> > superuser logs in
> > https://issues.apache.org/jira/browse/CASSANDRA-9608 - Get Cassandra
> ready
> > for java9 - if anyone has time to help find issues with jdk9, the oracle
> > team asked for feedback some time ago, I'm not sure if they're still
> > interested, but it'd be good for us to identify issues anyway.
> > https://issues.apache.org/jira/browse/CASSANDRA-13315 - talks about
> using
> > more meaningful names for consistency levels (as aliases)
> > https://issues.apache.org/jira/browse/CASSANDRA-13241 - talks about
> > dropping default chunk size for better IO performance (may have a
> blocking
> > subtask of making compression info more efficient before we do it, but
> you
> > may still be interested in the discussion)
> >
> > Patches that could use a reviewer:
> >
> > LHF (MAY be suitable for someone who is a solid programmer, even if
> they're
> > not super familiar with Cassandra internals - even if you're not a
> regular
> > contributor, you can review and +1 if you're confident that the fix is
> > correct and has been reasonably tested - see the 'How to review'
> checklist
> > http://cassandra.apache.org/doc/latest/development/how_to_review.html ):
> > https://issues.apache.org/jira/browse/CASSANDRA-13307 - Making CQLSH
> > downgrade protocol version automatically
> > https://issues.apache.org/jira/browse/CASSANDRA-13067 - AWS EFS fixes
> > (presents as huge filesystem, overflows internal variables)
> > https://issues.apache.org/jira/browse/CASSANDRA-13263 - Incorrect
> > ComplexColumnData hash code implementation
> > https://issues.apache.org/jira/browse/CASSANDRA-12719 - Typo in docs
> > examples
> >
> > Proposed New features without a reviewer:
> > https://issues.apache.org/jira/browse/CASSANDRA-13269 - Snapshot support
> > for secondary index
> > https://issues.apache.org/jira/browse/CASSANDRA-13270 - Add function
> hooks
> > to deliver Elasticsearch as a Cassandra plugin
> > https://issues.apache.org/jira/browse/CASSANDRA-13304 - Add checksumming
> > to
> > the native protocol
> > https://issues.apache.org/jira/browse/CASSANDRA-13268 - Add 2i to static
> > columns
> >
> > Enhancements without a reviewer:
> > https://issues.apache.org/jira/browse/CASSANDRA-9989 - Optimize BTree
> > builder
> > https://issues.apache.org/jira/browse/CASSANDRA-9988 - BTree Leaf-only
> > iterator
> > https://issues.apache.org/jira/browse/CASSANDRA-12837 - Multithreaded
> > support for rebuild_index
> >
> > Bugs without a reviewer:
> > https://issues.apache.org/jira/browse/CASSANDRA-13323 -
> > IncomingTcpConnection closed on single bad message
> > https://issues.apache.org/jira/browse/CASSANDRA-13020 - Hint delivery
> > fails
> > when prefer_local enabled
> >
> > The new "official" docs sections that could use some contributors! This
> may
> > be the lowest barrier to entry if you want to get involved - the docs are
> > kept in source control ( https://github.com/apache/
> > cassandra/tree/trunk/doc
> > ) , so we can change them with patches against those files submitted
> > through JIRA. No github pull requests, please, just submit a patch on
> JIRA
> > (see https://issues.apache.org/jira/browse/CASSANDRA-13160 or
> > https://issues.apache.org/jira/browse/CASSANDRA-12449 as examples):
> >
> > Some examples that are just stubbed out:
> >
> > http://cassandra.apache.org/doc/latest/architecture/index.html
> > http://cassandra.apache.org/doc/latest/architecture/dynamo.html
> > http://cassandra.apache.org/doc/latest/architecture/storage_engine.html
> > http://cassandra.apache.org/doc/latest/data_modeling/index.html
> > http://cassandra.apache.org/doc/latest/operating/read_repair.html
> > http://cassandra.apache.org/doc/latest/operating/repair.html
> > http://cassandra.apache.org/doc/latest/operating/hints.html
> > http://cassandra.apache.org/doc/latest/operating/backups.html
> > http://cassandra.apache.org/doc/latest/o

Re: Have a CDC commitLog process option in Cassandra

2017-02-09 Thread Dikang Gu
Is it for testing purpose?

On Thu, Feb 9, 2017 at 3:54 PM, Jay Zhuang 
wrote:

> Hi,
>
> To process the CDC commitLogs, it requires a separate Daemon process, Carl
> has a Daemon example here: CASSANDRA-11575.
>
> Does it make sense to integrate it into Cassandra? So the user doesn't
> have to manage another JVM on the same box. Then provide an ITrigger like
> interface (https://github.com/apache/cassandra/blob/trunk/src/java/org
> /apache/cassandra/triggers/ITrigger.java#L49) to process the data.
>
> Or maybe provide an interface option to handle the CDC commitLog in
> SegmentManager(https://github.com/apache/cassandra/blob/trun
> k/src/java/org/apache/cassandra/db/commitlog/CommitLogSegmen
> tManagerCDC.java#L68).
>
> Any comments? If it make sense, I could create a JIRA for that.
>
> Thanks,
> Jay
>



-- 
Dikang


Re: Perf regression between 2.2.5 and 3.11

2017-01-25 Thread Dikang Gu
We still see perf regression with the otc_coalescing_strategy disabled, and
on 3.0.10 branches as well. We can share our stress test results here.

@Sankalp, are you guys seeing any throughput regressions on 3.0.10 branch?

Thanks.

On Thu, Jan 19, 2017 at 2:33 PM, Jeremiah D Jordan <
jeremiah.jor...@gmail.com> wrote:

> You may be getting perf issues from message coalescing depending on what
> CL you are testing with:
> https://issues.apache.org/jira/browse/CASSANDRA-12676
>
> Try you tests with:
> otc_coalescing_strategy: DISABLED
>
> > On Jan 19, 2017, at 4:28 PM, Andrew Whang 
> wrote:
> >
> > Hi,
> >
> > I'm seeing perf regressions (using cassandra-stress) between 2.2.5 and
> > 3.11. I understand these versions are quite far apart, but just wondering
> > if there are stress results publicly available that compare 2.x to 3.x?
> > Thanks.
>
>


-- 
Dikang


Re: Dropped messages on random nodes.

2017-01-24 Thread Dikang Gu
Thanks guys! Jeff Jirsa helped me take a look, and I found a 10sec young gc
pause in the GC log.

3071128K->282000K(3495296K), 0.1144648 secs]
25943529K->23186623K(66409856K), 9.8971781 secs] [Times: user=2.33
sys=0.00, real=9.89 secs]

I'm trying to get a histogram or heap dump.


Thanks!



On Mon, Jan 23, 2017 at 7:08 PM, Brandon Williams  wrote:

> The lion's share of your drops are from cross-node timeouts, which require
> clock synchronization, so check that first.  If your clocks are synced,
> that means not only are you showing eager dropping based on time, but
> despite the eager dropping you are still facing overload.
>
> That local, non-gc pause is also troubling. (I assume non-gc since there
> wasn't anything logged by the GC inspector.)
>
> On Mon, Jan 23, 2017 at 12:36 AM, Dikang Gu  wrote:
>
> > Hello there,
> >
> > We have a 100 nodes ish cluster, I find that there are dropped messages
> on
> > random nodes in the cluster, which caused error spikes and P99 latency
> > spikes as well.
> >
> > I tried to figure out the cause. I do not see any obvious bottleneck in
> > the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do
> > see some suspicious gossip events around that time, not sure if it's
> > related.
> >
> > 2017-01-21_16:43:56.71033 WARN  16:43:56 [GossipTasks:1]: Not marking
> > nodes down due to local pause of 13079498815 > 50
> > 2017-01-21_16:43:56.85532 INFO  16:43:56 [ScheduledTasks:1]: MUTATION
> > messages were dropped in last 5000 ms: 65 for internal timeout and 10895
> > for cross node timeout
> > 2017-01-21_16:43:56.85533 INFO  16:43:56 [ScheduledTasks:1]: READ
> messages
> > were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross
> > node timeout
> > 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: Pool Name
> >Active   Pending  Completed   Blocked  All Time
> Blocked
> > 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]:
> MutationStage
> >   128 47794 1015525068 0
>  0
> > 2017-01-21_16:43:56.85535
> > 2017-01-21_16:43:56.85535 INFO  16:43:56 [ScheduledTasks:1]: ReadStage
> >64 20202  450508940 0
>  0
> >
> > Any suggestions?
> >
> > Thanks!
> >
> > --
> > Dikang
> >
> >
>



-- 
Dikang


Re: Dropped messages on random nodes.

2017-01-22 Thread Dikang Gu
Btw, the C* version is 2.2.5, with several backported patches.

On Sun, Jan 22, 2017 at 10:36 PM, Dikang Gu  wrote:

> Hello there,
>
> We have a 100 nodes ish cluster, I find that there are dropped messages on
> random nodes in the cluster, which caused error spikes and P99 latency
> spikes as well.
>
> I tried to figure out the cause. I do not see any obvious bottleneck in
> the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do
> see some suspicious gossip events around that time, not sure if it's
> related.
>
> 2017-01-21_16:43:56.71033 WARN  16:43:56 [GossipTasks:1]: Not marking
> nodes down due to local pause of 13079498815 > 50
> 2017-01-21_16:43:56.85532 INFO  16:43:56 [ScheduledTasks:1]: MUTATION
> messages were dropped in last 5000 ms: 65 for internal timeout and 10895
> for cross node timeout
> 2017-01-21_16:43:56.85533 INFO  16:43:56 [ScheduledTasks:1]: READ messages
> were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross
> node timeout
> 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: Pool Name
>Active   Pending  Completed   Blocked  All Time Blocked
> 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: MutationStage
>   128 47794 1015525068 0 0
> 2017-01-21_16:43:56.85535
> 2017-01-21_16:43:56.85535 INFO  16:43:56 [ScheduledTasks:1]: ReadStage
>64 20202  450508940 0 0
>
> Any suggestions?
>
> Thanks!
>
> --
> Dikang
>
>


-- 
Dikang


Dropped messages on random nodes.

2017-01-22 Thread Dikang Gu
Hello there,

We have a 100 nodes ish cluster, I find that there are dropped messages on
random nodes in the cluster, which caused error spikes and P99 latency
spikes as well.

I tried to figure out the cause. I do not see any obvious bottleneck in the
cluster, the C* nodes still have plenty of cpu idle/disk io. But I do see
some suspicious gossip events around that time, not sure if it's related.

2017-01-21_16:43:56.71033 WARN  16:43:56 [GossipTasks:1]: Not marking nodes
down due to local pause of 13079498815 > 50
2017-01-21_16:43:56.85532 INFO  16:43:56 [ScheduledTasks:1]: MUTATION
messages were dropped in last 5000 ms: 65 for internal timeout and 10895
for cross node timeout
2017-01-21_16:43:56.85533 INFO  16:43:56 [ScheduledTasks:1]: READ messages
were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross
node timeout
2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: Pool Name
   Active   Pending  Completed   Blocked  All Time Blocked
2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: MutationStage
  128 47794 1015525068 0 0
2017-01-21_16:43:56.85535
2017-01-21_16:43:56.85535 INFO  16:43:56 [ScheduledTasks:1]: ReadStage
   64 20202  450508940 0 0

Any suggestions?

Thanks!

-- 
Dikang


Re: Wrapping up tick-tock

2017-01-10 Thread Dikang Gu
+1 to 6 months *major* release.

I think we still need *minor* release containing bug fixes (or small
features maybe?), which I think would make sense to release more
frequently, like monthly. So that we won't need to wait for 6 months for
bug fixes, or have to maintain a lot of patches internally.

On Tue, Jan 10, 2017 at 1:56 PM, sankalp kohli 
wrote:

> +1 to 6 month release and ending tick/tock
>
> On Tue, Jan 10, 2017 at 9:44 AM, Nate McCall  wrote:
>
> > >
> > > If this question is to outside the topic and more appropriate for a
> > > different thread I'm happy to put a hold on it until the release
> cadence
> > is
> > > agreed.
> > >
> >
> > Let's please do put this on another thread. Thanks for bringing it up
> > though as it is important and needs discussion.
> >
>



-- 
Dikang


Re: STCS in L0 behaviour

2016-11-26 Thread Dikang Gu
Hi Marcus,

Do you have some stack trace to show that which function in the `
getNextBackgroundTask` is most expensive?

Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
cluster, I try my best to reduce the impact of repair, and keep number of
sstables in L0 < 100.

Thanks
Dikang.

On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall  wrote:

> > The reason is described here:
> https://issues.apache.org/jira/browse/CASSANDRA-5371?
> focusedCommentId=13621679&page=com.atlassian.jira.
> plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
> >
> > /Marcus
>
> "...a lot of the work you've done you will redo when you compact your now
> bigger L0 sstable against L1."
>
> ^ Sylvain's hypothesis (next comment down) is actually something we see
> occasionally in practice: having to re-write the contents of L1 too often
> when large L0 SSTables are pulled in. Here is an example we took on a
> system with pending compaction spikes that was seeing this specific issue
> with four LCS-based tables:
>
> https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950
>
> The significant part of this particular workload is a burst of heavy writes
> from long-duration scheduled jobs.
>



-- 
Dikang


Node replacement failed in 2.2

2016-11-18 Thread Dikang Gu
Hi, I encountered couple times that I could not replace a down node due to
error:

2016-11-17_19:33:58.70075 Exception (java.lang.RuntimeException)
encountered during startup: Could not find tokens for
/2401:db00:2130:4091:face:0:13:0 to replace
2016-11-17_19:33:58.70489 ERROR 19:33:58 [main]: Exception encountered
during startup
2016-11-17_19:33:58.70491 java.lang.RuntimeException: Could not find tokens
for /2401:db00:2130:4091:face:0:13:0 to replace
2016-11-17_19:33:58.70491   at
org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:525)
~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
2016-11-17_19:33:58.70492   at
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:760)
~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
2016-11-17_19:33:58.70492   at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:693)
~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
2016-11-17_19:33:58.70492   at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:585)
~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
2016-11-17_19:33:58.70492   at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300)
[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
2016-11-17_19:33:58.70493   at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516)
[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
2016-11-17_19:33:58.70493   at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625)
[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
2016-11-17_19:33:58.70649 INFO  19:33:58 [StorageServiceShutdownHook]:
Announcing shutdown
2016-11-17_19:34:00.70967 INFO  19:34:00 [StorageServiceShutdownHook]:
Waiting for messaging service to quiesce
2016-11-17_19:34:00.71066 INFO  19:34:00
[ACCEPT-/2401:db00:2130:4091:face:0:13:0]: MessagingService has terminated
the accept() thread

Did not find a relevant ticket for this, is anyone aware of this?

Thanks!

-- 
Dikang


Re: Slow performance after upgrading from 2.0.9 to 2.1.11

2016-11-08 Thread Dikang Gu
Michael, thanks for the info. It sounds to me a very serious performance
regression. :(

On Tue, Nov 8, 2016 at 11:39 AM, Michael Kjellman <
mkjell...@internalcircle.com> wrote:

> Yes, We hit this as well. We have a internal patch that I wrote to mostly
> revert the behavior back to ByteBuffers with as small amount of code change
> as possible. Performance of our build is now even with 2.0.x and we've also
> forward ported it to 3.x (although the 3.x patch was even more complicated
> due to Bounds, RangeTombstoneBound, ClusteringPrefix which actually
> increases the number of allocations to somewhere between 11 and 13
> depending on how I count it per indexed block -- making it even worse than
> what you're observing in 2.1.
>
> We haven't upstreamed it as 2.1 is obviously not taking any changes at
> this point and the longer term solution is https://issues.apache.org/
> jira/browse/CASSANDRA-9754 (which also includes the changes to go back to
> ByteBuffers and remove as much of the Composites from the storage engine as
> possible.) Also, the solution is a bit of a hack -- although it was a
> blocker from us deploying 2.1 -- so i'm not sure how "hacky" it is if it
> works..
>
> best,
> kjellman
>
>
> On Nov 8, 2016, at 11:31 AM, Dikang Gu mailto:dik
> an...@gmail.com>> wrote:
>
> This is very expensive:
>
> "MessagingService-Incoming-/2401:db00:21:1029:face:0:9:0" prio=10
> tid=0x7f2fd57e1800 nid=0x1cc510 runnable [0x7f2b971b]
>java.lang.Thread.State: RUNNABLE
> at org.apache.cassandra.db.marshal.IntegerType.compare(
> IntegerType.java:29)
> at org.apache.cassandra.db.composites.AbstractSimpleCellNameType.
> compare(AbstractSimpleCellNameType.java:98)
> at org.apache.cassandra.db.composites.AbstractSimpleCellNameType.
> compare(AbstractSimpleCellNameType.java:31)
> at java.util.TreeMap.put(TreeMap.java:545)
> at java.util.TreeSet.add(TreeSet.java:255)
> at org.apache.cassandra.db.filter.NamesQueryFilter$
> Serializer.deserialize(NamesQueryFilter.java:254)
> at org.apache.cassandra.db.filter.NamesQueryFilter$
> Serializer.deserialize(NamesQueryFilter.java:228)
> at org.apache.cassandra.db.SliceByNamesReadCommandSeriali
> zer.deserialize(SliceByNamesReadCommand.java:104)
> at org.apache.cassandra.db.ReadCommandSerializer.
> deserialize(ReadCommand.java:156)
> at org.apache.cassandra.db.ReadCommandSerializer.
> deserialize(ReadCommand.java:132)
> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
> at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(
> IncomingTcpConnection.java:195)
> at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(
> IncomingTcpConnection.java:172)
> at org.apache.cassandra.net.IncomingTcpConnection.run(
> IncomingTcpConnection.java:88)
>
>
> Checked the git history, it comes from this jira:
> https://issues.apache.org/jira/browse/CASSANDRA-5417
>
> Any thoughts?
> ​
>
> On Fri, Oct 28, 2016 at 10:32 AM, Paulo Motta  mailto:pauloricard...@gmail.com>> wrote:
> Haven't seen this before, but perhaps it's related to CASSANDRA-10433?
> This is just a wild guess as it's in a related codepath, but maybe worth
> trying out the patch available to see if it helps anything...
>
> 2016-10-28 15:03 GMT-02:00 Dikang Gu mailto:dik
> an...@gmail.com>>:
> We are seeing huge cpu regression when upgrading one of our 2.0.16 cluster
> to 2.1.14 as well. The 2.1.14 node is not able to handle the same amount of
> read traffic as the 2.0.16 node, actually, it's less than 50%.
>
> And in the perf results, the first line could go as high as 50%, as we
> turn up the read traffic, which never appeared in 2.0.16.
>
> Any thoughts?
> Thanks
>
>
> Samples: 952K of event 'cycles', Event count (approx.): 229681774560
> Overhead  Shared Object  Symbol
>6.52%  perf-196410.map[.]
> Lorg/apache/cassandra/db/marshal/IntegerType;.compare in
> Lorg/apache/cassandra/db/composites/AbstractSimpleCellNameType;.compare
>4.84%  libzip.so  [.] adler32
>2.88%  perf-196410.map[.]
> Ljava/nio/HeapByteBuffer;.get in Lorg/apache/cassandra/db/
> marshal/IntegerType;.compare
>2.39%  perf-196410.map[.]
> Ljava/nio/Buffer;.checkIndex in Lorg/apache/cassandra/db/
> marshal/IntegerType;.findMostSignificantByte
>2.03%  perf-196410.map[.]
> Ljava/math/BigInteger;.compareTo in Lorg/apache/cassandra/db/
> DecoratedKey;.compareTo
>1.6

Re: Slow performance after upgrading from 2.0.9 to 2.1.11

2016-11-08 Thread Dikang Gu
This is very expensive:

"MessagingService-Incoming-/2401:db00:21:1029:face:0:9:0" prio=10
tid=0x7f2fd57e1800 nid=0x1cc510 runnable [0x7f2b971b]
   java.lang.Thread.State: RUNNABLE
at
org.apache.cassandra.db.marshal.IntegerType.compare(IntegerType.java:29)
at
org.apache.cassandra.db.composites.AbstractSimpleCellNameType.compare(AbstractSimpleCellNameType.java:98)
at
org.apache.cassandra.db.composites.AbstractSimpleCellNameType.compare(AbstractSimpleCellNameType.java:31)
at java.util.TreeMap.put(TreeMap.java:545)
at java.util.TreeSet.add(TreeSet.java:255)
at
org.apache.cassandra.db.filter.NamesQueryFilter$Serializer.deserialize(NamesQueryFilter.java:254)
at
org.apache.cassandra.db.filter.NamesQueryFilter$Serializer.deserialize(NamesQueryFilter.java:228)
at
org.apache.cassandra.db.SliceByNamesReadCommandSerializer.deserialize(SliceByNamesReadCommand.java:104)
at
org.apache.cassandra.db.ReadCommandSerializer.deserialize(ReadCommand.java:156)
at
org.apache.cassandra.db.ReadCommandSerializer.deserialize(ReadCommand.java:132)
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:172)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)


Checked the git history, it comes from this jira:
https://issues.apache.org/jira/browse/CASSANDRA-5417

Any thoughts?
​

On Fri, Oct 28, 2016 at 10:32 AM, Paulo Motta 
wrote:

> Haven't seen this before, but perhaps it's related to CASSANDRA-10433?
> This is just a wild guess as it's in a related codepath, but maybe worth
> trying out the patch available to see if it helps anything...
>
> 2016-10-28 15:03 GMT-02:00 Dikang Gu :
>
>> We are seeing huge cpu regression when upgrading one of our 2.0.16
>> cluster to 2.1.14 as well. The 2.1.14 node is not able to handle the same
>> amount of read traffic as the 2.0.16 node, actually, it's less than 50%.
>>
>> And in the perf results, the first line could go as high as 50%, as we
>> turn up the read traffic, which never appeared in 2.0.16.
>>
>> Any thoughts?
>> Thanks
>>
>>
>> Samples: 952K of event 'cycles', Event count (approx.): 229681774560
>> Overhead  Shared Object  Symbol
>>6.52%  perf-196410.map[.]
>> Lorg/apache/cassandra/db/marshal/IntegerType;.compare in
>> Lorg/apache/cassandra/db/composites/AbstractSimpleCellNameType;.compare
>>4.84%  libzip.so  [.] adler32
>>2.88%  perf-196410.map[.]
>> Ljava/nio/HeapByteBuffer;.get in Lorg/apache/cassandra/db/marsh
>> al/IntegerType;.compare
>>2.39%  perf-196410.map[.]
>> Ljava/nio/Buffer;.checkIndex in Lorg/apache/cassandra/db/marsh
>> al/IntegerType;.findMostSignificantByte
>>2.03%  perf-196410.map[.]
>> Ljava/math/BigInteger;.compareTo in Lorg/apache/cassandra/db/Decor
>> atedKey;.compareTo
>>1.65%  perf-196410.map[.] vtable chunks
>>1.44%  perf-196410.map[.]
>> Lorg/apache/cassandra/db/DecoratedKey;.compareTo in
>> Ljava/util/concurrent/ConcurrentSkipListMap;.findNode
>>1.02%  perf-196410.map[.]
>> Lorg/apache/cassandra/db/composites/AbstractSimpleCellNameType;.compare
>>1.00%  snappy-1.0.5.2-libsnappyjava.so[.] 0x3804
>>0.87%  perf-196410.map[.]
>> Ljava/io/DataInputStream;.readFully in Lorg/apache/cassandra/db/Abstr
>> actCell$1;.computeNext
>>0.82%  snappy-1.0.5.2-libsnappyjava.so[.] 0x36dc
>>0.79%  [kernel]   [k]
>> copy_user_generic_string
>>0.73%  perf-196410.map[.] vtable chunks
>>0.71%  perf-196410.map[.]
>> Lorg/apache/cassandra/db/OnDiskAtom$Serializer;.deserializeFromSSTable
>> in Lorg/apache/cassandra/db/AbstractCell$1;.computeNext
>>0.70%  [kernel]   [k] find_busiest_group
>>0.69%  perf-196410.map[.] <80>H3^?
>>0.68%  perf-196410.map[.]
>> Lorg/apache/cassandra/db/DecoratedKey;.compareTo
>>0.65%  perf-196410.map[.]
>> jbyte_disjoint_arraycopy
>>0.64%  [kernel]   [k] _raw_spin_lock
>>0.63%  [kernel]

Re: Broader community involvement in 4.0 (WAS Re: Rough roadmap for 4.0)

2016-11-07 Thread Dikang Gu
My 2 cents. I'm wondering is it a good idea to have some high level goals
for the major release? For example, the goals could be something like:
1. Improve the scalability/reliability/performance by X%.
2. Add Y new features (feature A, B, C, D...).
3. Fix Z known issues  (issue A, B, C, D...).

I feel If we can have the high level goals, it would be easy to pick the
jiras to be included in the release.

Does it make sense?

Thanks
Dikang.

On Mon, Nov 7, 2016 at 11:22 AM, Oleksandr Petrov <
oleksandr.pet...@gmail.com> wrote:

> Recently there was another discussion on documentation and comments [1]
>
> On one hand, documentation and comments will help newcomers to familiarise
> themselves with the codebase. On the other - one may get up to speed by
> reading the code and adding some docs. Such things may require less
> oversight and can play some role in improving diversity / increasing an
> amount of involved people.
>
> Same thing with tests. There are some areas where tests need some
> refactoring / improvements, or even just splitting them from one file to
> multiple. It's a good way to experience the process and get involved into
> discussion.
>
> For that, we could add some issues with subtasks (just a few for starters)
> or even just a wiki page with a doc/test wishlist where everyone could add
> a couple of points.
>
> Docs and tests could be used in addition to lhf issues, helping people,
> having comprehensive and quick process and everything else that was
> mentioned in this thread.
>
> Thank you.
>
> [1]
> http://mail-archives.apache.org/mod_mbox/cassandra-dev/201605.mbox/%
> 3ccakkz8q088ojbvhycyz2_2eotqk4y-svwiwksinpt6rr9pop...@mail.gmail.com%3E
>
> On Mon, Nov 7, 2016 at 5:38 PM Aleksey Yeschenko 
> wrote:
>
> > Agreed.
> >
> > --
> > AY
> >
> > On 7 November 2016 at 16:38:07, Jeff Jirsa (jeff.ji...@crowdstrike.com)
> > wrote:
> >
> > ‘Accepted’ JIRA status seems useful, but would encourage something more
> > explicit like ‘Concept Accepted’ or similar to denote that the concept is
> > agreed upon, but the actual patch itself may not be accepted yet.
> >
> > /bikeshed.
> >
> > On 11/7/16, 2:56 AM, "Ben Slater"  wrote:
> >
> > >Thanks Dave. The shepherd concept sounds a lot like I had in mind (and a
> > >better name).
> > >
> > >One other thing I noted from the Mesos process - they have an “Accepted”
> > >jira status that comes after open and means “at least one Mesos
> developer
> > >thought that the ideas proposed in the issue are worth pursuing
> further”.
> > >Might also be something to consider as part of a process like this?
> > >
> > >Cheers
> > >Ben
> > >
> > >On Mon, 7 Nov 2016 at 09:37 Dave Lester  wrote:
> > >
> > >> Hi Ben,
> > >>
> > >> A few ideas to add to your suggestions [inline]:
> > >>
> > >> On 2016-11-06 13:51 (-0800), Ben Slater <
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__ben.
> slater-40instaclustr.com&d=DgIFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kq
> hAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> 0Ynrto5MaNdgc2fOUtxv50ouikBU_P7VEv6KNub9Bhk&s=
> MAZTdq4wfrTiqh7nImEMcFWtTrsixRFOX7Pi0SKqQv0&e=
> > >
> > >> wrote:
> > >> > Hi All,
> > >> >
> > >> > I thought I would add a couple of observations and suggestions as
> > someone
> > >> > who has both personally made my first contributions to the project
> in
> > the
> > >> > last few months and someone in a leadership role in an organisation
> > >> > (Instaclustr) that is feeling it’s way through increasing our
> > >> contributions
> > >> > as an organisation.
> > >> >
> > >> > Firstly - an observation on contribution experience and what I think
> > is
> > >> > likely to make people want to contribute again:
> > >> > 1) The worst thing that can happen is for your contribution to be
> > >> > completely ignored.
> > >> > 2) The second worst thing is for it to be rejected without a good
> > >> > explanation (that you can learn from) or with hostility.
> > >> > 3) Having it rejected with a good reason is not a bad thing (you
> > learn)
> > >> > 4) Having it accepted is, of course, the best!
> > >> >
> > >> > With this as a background I would suggest a couple of thing that
> help
> > >> make
> > >> > sure (3) and (4) are always more common that (1) and (2) (good
> > outcomes
> > >> are
> > >> > probably more common than bad at the moment but we’ve experienced
> all
> > >> four
> > >> > scenarios in the last few months):
> > >> > 1) I think some process of assigning a committer of a “sponsor” of a
> > >> change
> > >> > (which would probably mean committers volunteering) before it
> > commences
> > >> > would be useful. You can kind of do this at the moment by creating a
> > JIRA
> > >> > and asking for comment but I think the process is a bit unclear and
> a
> > bit
> > >> > intimidating for people starting off and it would be nice to know
> who
> > was
> > >> > your primary reviewer for a piece of work. (Or maybe this process
> does
> > >> > exist and I don’t know about.)
> > >>
> > >> I've seen this appr

Re: Batch read requests to one physical host?

2016-10-24 Thread Dikang Gu
I'd like to move forward on this idea, and want to get more advises about
it.

Currently, I already have a hacked version of the batched read requests,
which implements a MultiReadCommand, and the coordinator nodes will group
the read commands to the same first endpoint together and construct a
new MultiReadCommand. On the data node, when it receives
the MultiReadCommand, there will be a new MultiReadVerbHandler who executes
the read commands sequentially and return the results.

There are still a lot of work to be done, such as:
1. Have the speculative retry support for MultiReadCommand.
2. Have a thread pool to execute the read commands instead of executing
them sequentially.
3. Group by partition instead of by first endpoint.

Before I move forward, I'd like to collect some advises from you guys. Does
it make sense to you? Any thoughts?

Thanks
Dikang.

On Wed, Oct 19, 2016 at 3:29 PM, Dikang Gu  wrote:

> I create a new jira to track this: CASSANDRA-12814, which is linked to
> CASSANDRA-10414.
>
> @Nate, agree. And it would be great if we can batch the reads from
> different partitions but still on the same physical host as well, which
> will be valuable for our existing and potential use cases.
>
> Thanks
> Dikang.
>
> On Wed, Oct 19, 2016 at 3:01 PM, Nate McCall  wrote:
>
>> I see a few slightly different things here (equally valuable) in
>> conjunction with CASSANDRA-10414:
>> - Wanting a small number of specific, non-sequential rows out of the
>> same partition (this is common, IME) and grouping those
>> - Extending batch semantics to reads with the same understanding with
>> mutate that if you put different partitions in the same batch it will
>> be slow
>>
>> (I think Eric's IN(..) sorta fits with either of those).
>>
>> Interesting!
>>
>> On Thu, Oct 20, 2016 at 4:26 AM, Tyler Hobbs  wrote:
>> > There's a similar ticket focusing on range reads and secondary index
>> > queries, but the work for these could be done together:
>> > https://issues.apache.org/jira/browse/CASSANDRA-10414
>> >
>> > On Tue, Oct 18, 2016 at 5:59 PM, Dikang Gu  wrote:
>> >
>> >> Hi there,
>> >>
>> >> We have couple use cases that are doing fanout read for their data,
>> means
>> >> one single read request from client contains multiple keys which live
>> on
>> >> different physical hosts. (I know it's not recommended way to access
>> C*).
>> >>
>> >> Right now, on the coordinator, it will issue separate read commands
>> even
>> >> though they will go to the same physical host, which I think is
>> causing a
>> >> lot of overheads.
>> >>
>> >> I'm wondering is it valuable to provide a new read command, that
>> >> coordinator can batch the reads to one datanode, and send to it in one
>> >> message, and datanode will return the results for all keys belong to
>> it?
>> >>
>> >> Any similar ideas before?
>> >>
>> >>
>> >> --
>> >> Dikang
>> >>
>> >
>> >
>> >
>> > --
>> > Tyler Hobbs
>> > DataStax <http://datastax.com/>
>>
>
>
>
> --
> Dikang
>
>


-- 
Dikang


Re: Batch read requests to one physical host?

2016-10-19 Thread Dikang Gu
I create a new jira to track this: CASSANDRA-12814, which is linked to
CASSANDRA-10414.

@Nate, agree. And it would be great if we can batch the reads from
different partitions but still on the same physical host as well, which
will be valuable for our existing and potential use cases.

Thanks
Dikang.

On Wed, Oct 19, 2016 at 3:01 PM, Nate McCall  wrote:

> I see a few slightly different things here (equally valuable) in
> conjunction with CASSANDRA-10414:
> - Wanting a small number of specific, non-sequential rows out of the
> same partition (this is common, IME) and grouping those
> - Extending batch semantics to reads with the same understanding with
> mutate that if you put different partitions in the same batch it will
> be slow
>
> (I think Eric's IN(..) sorta fits with either of those).
>
> Interesting!
>
> On Thu, Oct 20, 2016 at 4:26 AM, Tyler Hobbs  wrote:
> > There's a similar ticket focusing on range reads and secondary index
> > queries, but the work for these could be done together:
> > https://issues.apache.org/jira/browse/CASSANDRA-10414
> >
> > On Tue, Oct 18, 2016 at 5:59 PM, Dikang Gu  wrote:
> >
> >> Hi there,
> >>
> >> We have couple use cases that are doing fanout read for their data,
> means
> >> one single read request from client contains multiple keys which live on
> >> different physical hosts. (I know it's not recommended way to access
> C*).
> >>
> >> Right now, on the coordinator, it will issue separate read commands even
> >> though they will go to the same physical host, which I think is causing
> a
> >> lot of overheads.
> >>
> >> I'm wondering is it valuable to provide a new read command, that
> >> coordinator can batch the reads to one datanode, and send to it in one
> >> message, and datanode will return the results for all keys belong to it?
> >>
> >> Any similar ideas before?
> >>
> >>
> >> --
> >> Dikang
> >>
> >
> >
> >
> > --
> > Tyler Hobbs
> > DataStax <http://datastax.com/>
>



-- 
Dikang


Batch read requests to one physical host?

2016-10-18 Thread Dikang Gu
Hi there,

We have couple use cases that are doing fanout read for their data, means
one single read request from client contains multiple keys which live on
different physical hosts. (I know it's not recommended way to access C*).

Right now, on the coordinator, it will issue separate read commands even
though they will go to the same physical host, which I think is causing a
lot of overheads.

I'm wondering is it valuable to provide a new read command, that
coordinator can batch the reads to one datanode, and send to it in one
message, and datanode will return the results for all keys belong to it?

Any similar ideas before?


-- 
Dikang


Hints handoff is memory intensive

2016-09-19 Thread Dikang Gu
In our 2.1 cluster, I find that hints handoff is using a lot of memory on
our proxy nodes, when delivering hints to a data node that was dead for 3+
hours (our hints window is 3 hours). It makes the young gen GC time as long
as 2 secs.

I'm using 64G max heap size, and 4G young gen size. I'm considering
increasing the young gen size, I'm wondering any improvements to hints
handoff in newer version?

some logs:
2016-09-20_05:04:30.67928 INFO  05:04:30 [Service Thread]: ParNew GC in
2200ms.  CMS Old Gen: 49112623080 -> 50052787792; Par Eden Space:
2863398912 -> 0;

mem usage:
2016-09-20T01:25:23.522+ Process summary
  process cpu=0.00%
  application cpu=140.36% (user=132.28% sys=8.08%)
  other: cpu=-140.36%
  heap allocation rate 1500mb/s
[001343] user=90.42% sys= 0.01% alloc= 1439mb/s - HintedHandoff:3
[290234] user=18.85% sys= 5.14% alloc=   40mb/s - CompactionExecutor:176226
[064445] user=17.57% sys= 1.09% alloc=   10mb/s - RMI TCP
Connection(148376)-127.0.0.1
[290439] user= 3.20% sys= 1.10% alloc= 4607kb/s - RMI TCP
Connection(148380)-127.0.0.1
[16] user= 0.75% sys=-0.03% alloc= 2321kb/s - ScheduledTasks:1
[000149] user= 0.32% sys=-0.04% alloc= 1516kb/s - GossipStage:1
[001370] user= 0.11% sys= 0.04% alloc=  420kb/s - GossipTasks:1

Thanks
Dikang.


Re: Can we change the number of vnodes on existing nodes?

2016-09-11 Thread Dikang Gu
Thanks for the suggestions! I may consider the new dc solution, although it
would involve significant operational work in our scale.

For gossip issue, like:
https://issues.apache.org/jira/browse/CASSANDRA-11709,
https://issues.apache.org/jira/browse/CASSANDRA-11740

On Fri, Sep 9, 2016 at 9:55 PM, Brandon Williams  wrote:

> I am curious exactly what gossip issues you are encountering.
>
> On Fri, Sep 9, 2016 at 7:45 PM, Dikang Gu  wrote:
>
> > Hi,
> >
> > We have some big cluster (500+ nodes), they have 256 vnodes on physical
> > host, which is causing a lot of problems to us, especially make the
> gossip
> > to be in-efficient.
> >
> > There seems no way to change the number of vnodes on existing nodes, is
> > there any reason that we can not support it? It should not be too
> different
> > from moving the node with one single token, right?
> >
> > Thanks
> > Dikang
> >
>



-- 
Dikang


Can we change the number of vnodes on existing nodes?

2016-09-09 Thread Dikang Gu
Hi,

We have some big cluster (500+ nodes), they have 256 vnodes on physical
host, which is causing a lot of problems to us, especially make the gossip
to be in-efficient.

There seems no way to change the number of vnodes on existing nodes, is
there any reason that we can not support it? It should not be too different
from moving the node with one single token, right?

Thanks
Dikang


Re: Counter values become under-counted when running repair.

2016-03-28 Thread Dikang Gu
Hi Aleksey, do you get a chance to take a look?

Thanks
Dikang.

On Thu, Mar 24, 2016 at 10:30 PM, Dikang Gu  wrote:

> @Aleksey, sure, here is the jira:
> https://issues.apache.org/jira/browse/CASSANDRA-11432
>
> Thanks!
>
> On Thu, Mar 24, 2016 at 5:32 PM, Aleksey Yeschenko 
> wrote:
>
>> Best open a JIRA ticket and I’ll have a look at what could be the reason.
>>
>> --
>> AY
>>
>> On 24 March 2016 at 23:20:55, Dikang Gu (dikan...@gmail.com) wrote:
>>
>> @Aleksey, we are writing to cluster with CL = 2, and reading with CL = 1.
>> And overall we have 6 copies across 3 different regions. Do you have
>> comments about our setup?
>>
>> During the repair, the counter value become inaccurate, we are still
>> playing with the repair, will keep you update with more experiments. But
>> do
>> you have any theory around that?
>>
>> Thanks a lot!
>> Dikang.
>>
>> On Thu, Mar 24, 2016 at 11:02 AM, Aleksey Yeschenko 
>> wrote:
>>
>> > After repair is over, does the value settle? What CLs do you write to
>> your
>> > counters with? What CLs are you reading with?
>> >
>> > --
>> > AY
>> >
>> > On 24 March 2016 at 06:17:27, Dikang Gu (dikan...@gmail.com) wrote:
>> >
>> > Hello there,
>> >
>> > We are experimenting Counters in Cassandra 2.2.5. Our setup is that we
>> > have
>> > 6 nodes, across three different regions, and in each region, the
>> > replication factor is 2. Basically, each nodes holds a full copy of the
>> > data.
>> >
>> > When are doing 30k/s counter increment/decrement per node, and at the
>> > meanwhile, we are double writing to our mysql tier, so that we can
>> measure
>> > the accuracy of C* counter, compared to mysql.
>> >
>> > The experiment result was great at the beginning, the counter value in
>> C*
>> > and mysql are very close. The difference is less than 0.1%.
>> >
>> > But when we start to run the repair on one node, the counter value in
>> C*
>> > become much less than the value in mysql, the difference becomes larger
>> > than 1%.
>> >
>> > My question is that is it a known problem that the counter value will
>> > become under-counted if repair is running? Should we avoid running
>> repair
>> > for counter tables?
>> >
>> > Thanks.
>> >
>> > --
>> > Dikang
>> >
>> >
>>
>>
>> --
>> Dikang
>>
>>
>
>
> --
> Dikang
>
>


-- 
Dikang


Re: Counter values become under-counted when running repair.

2016-03-24 Thread Dikang Gu
@Aleksey, sure, here is the jira:
https://issues.apache.org/jira/browse/CASSANDRA-11432

Thanks!

On Thu, Mar 24, 2016 at 5:32 PM, Aleksey Yeschenko 
wrote:

> Best open a JIRA ticket and I’ll have a look at what could be the reason.
>
> --
> AY
>
> On 24 March 2016 at 23:20:55, Dikang Gu (dikan...@gmail.com) wrote:
>
> @Aleksey, we are writing to cluster with CL = 2, and reading with CL = 1.
> And overall we have 6 copies across 3 different regions. Do you have
> comments about our setup?
>
> During the repair, the counter value become inaccurate, we are still
> playing with the repair, will keep you update with more experiments. But
> do
> you have any theory around that?
>
> Thanks a lot!
> Dikang.
>
> On Thu, Mar 24, 2016 at 11:02 AM, Aleksey Yeschenko 
> wrote:
>
> > After repair is over, does the value settle? What CLs do you write to
> your
> > counters with? What CLs are you reading with?
> >
> > --
> > AY
> >
> > On 24 March 2016 at 06:17:27, Dikang Gu (dikan...@gmail.com) wrote:
> >
> > Hello there,
> >
> > We are experimenting Counters in Cassandra 2.2.5. Our setup is that we
> > have
> > 6 nodes, across three different regions, and in each region, the
> > replication factor is 2. Basically, each nodes holds a full copy of the
> > data.
> >
> > When are doing 30k/s counter increment/decrement per node, and at the
> > meanwhile, we are double writing to our mysql tier, so that we can
> measure
> > the accuracy of C* counter, compared to mysql.
> >
> > The experiment result was great at the beginning, the counter value in
> C*
> > and mysql are very close. The difference is less than 0.1%.
> >
> > But when we start to run the repair on one node, the counter value in C*
> > become much less than the value in mysql, the difference becomes larger
> > than 1%.
> >
> > My question is that is it a known problem that the counter value will
> > become under-counted if repair is running? Should we avoid running
> repair
> > for counter tables?
> >
> > Thanks.
> >
> > --
> > Dikang
> >
> >
>
>
> --
> Dikang
>
>


-- 
Dikang


Re: Counter values become under-counted when running repair.

2016-03-24 Thread Dikang Gu
@Aleksey, we are writing to cluster with CL = 2, and reading with CL = 1.
And overall we have 6 copies across 3 different regions. Do you have
comments about our setup?

During the repair, the counter value become inaccurate, we are still
playing with the repair, will keep you update with more experiments. But do
you have any theory around that?

Thanks a lot!
Dikang.

On Thu, Mar 24, 2016 at 11:02 AM, Aleksey Yeschenko 
wrote:

> After repair is over, does the value settle? What CLs do you write to your
> counters with? What CLs are you reading with?
>
> --
> AY
>
> On 24 March 2016 at 06:17:27, Dikang Gu (dikan...@gmail.com) wrote:
>
> Hello there,
>
> We are experimenting Counters in Cassandra 2.2.5. Our setup is that we
> have
> 6 nodes, across three different regions, and in each region, the
> replication factor is 2. Basically, each nodes holds a full copy of the
> data.
>
> When are doing 30k/s counter increment/decrement per node, and at the
> meanwhile, we are double writing to our mysql tier, so that we can measure
> the accuracy of C* counter, compared to mysql.
>
> The experiment result was great at the beginning, the counter value in C*
> and mysql are very close. The difference is less than 0.1%.
>
> But when we start to run the repair on one node, the counter value in C*
> become much less than the value in mysql, the difference becomes larger
> than 1%.
>
> My question is that is it a known problem that the counter value will
> become under-counted if repair is running? Should we avoid running repair
> for counter tables?
>
> Thanks.
>
> --
> Dikang
>
>


-- 
Dikang


Counter values become under-counted when running repair.

2016-03-23 Thread Dikang Gu
Hello there,

We are experimenting Counters in Cassandra 2.2.5. Our setup is that we have
6 nodes, across three different regions, and in each region, the
replication factor is 2. Basically, each nodes holds a full copy of the
data.

When are doing 30k/s counter increment/decrement per node, and at the
meanwhile, we are double writing to our mysql tier, so that we can measure
the accuracy of C* counter, compared to mysql.

The experiment result was great at the beginning, the counter value in C*
and mysql are very close. The difference is less than 0.1%.

But when we start to run the repair on one node, the counter value in C*
become much less than the value in mysql,  the difference becomes larger
than 1%.

My question is that is it a known problem that the counter value will
become under-counted if repair is running? Should we avoid running repair
for counter tables?

Thanks.

-- 
Dikang


Re: How to measure the write amplification of C*?

2016-03-23 Thread Dikang Gu
As a follow-up, I'm going to write a simple patch to expose the number of
flushed bytes from memtable to JMX, so that we can easily monitor it.

Here is the jira: https://issues.apache.org/jira/browse/CASSANDRA-11420

On Thu, Mar 10, 2016 at 12:55 PM, Jack Krupansky 
wrote:

> The doc does say this:
>
> "A log-structured engine that avoids overwrites and uses sequential IO to
> update data is essential for writing to solid-state disks (SSD) and hard
> disks (HDD) On HDD, writing randomly involves a higher number of seek
> operations than sequential writing. The seek penalty incurred can be
> substantial. Using sequential IO (thereby avoiding write amplification
> <http://en.wikipedia.org/wiki/Write_amplification> and disk failure),
> Cassandra accommodates inexpensive, consumer SSDs extremely well."
>
> I presume that write amplification argues for placing the commit log on a
> separate SSD device. That should probably be mentioned.
>
> -- Jack Krupansky
>
> On Thu, Mar 10, 2016 at 12:52 PM, Matt Kennedy 
> wrote:
>
>> It isn't really the data written by the host that you're concerned with,
>> it's the data written by your application. I'd start by instrumenting your
>> application tier to tally up the size of the values that it writes to C*.
>>
>> However, it may not be extremely useful to have this value. You can't do
>> much with the information it provides. It is probably a better idea to
>> track the bytes written to flash for each drive so that you know the
>> physical endurance of that type of drive given your workload. Unfortunately
>> the TBW endurance rated for the drive may not be extremely useful given the
>> difference between the synthetic workload used to create those ratings and
>> the workload that Cassandra is producing for your particular case. You can
>> find out more about those here:
>> https://www.jedec.org/standards-documents/docs/jesd219a
>>
>>
>> Matt Kennedy
>>
>> Sr. Product Manager, DSE Core
>>
>> matt.kenn...@datastax.com | Public Calendar <http://goo.gl/4Ui04Z>
>>
>> *DataStax Enterprise - the database for cloud applications.*
>>
>> On Thu, Mar 10, 2016 at 11:44 AM, Dikang Gu  wrote:
>>
>>> Hi Matt,
>>>
>>> Thanks for the detailed explanation! Yes, this is exactly what I'm
>>> looking for, "write amplification = data written to flash/data written
>>> by the host".
>>>
>>> We are heavily using the LCS in production, so I'd like to figure out
>>> the amplification caused by that and see what we can do to optimize it. I
>>> have the metrics of "data written to flash", and I'm wondering is there
>>> an easy way to get the "data written by the host" on each C* node?
>>>
>>> Thanks
>>>
>>> On Thu, Mar 10, 2016 at 8:48 AM, Matt Kennedy 
>>> wrote:
>>>
>>>> TL;DR - Cassandra actually causes a ton of write amplification but it
>>>> doesn't freaking matter any more. Read on for details...
>>>>
>>>> That slide deck does have a lot of very good information on it, but
>>>> unfortunately I think it has led to a fundamental misunderstanding about
>>>> Cassandra and write amplification. In particular, slide 51 vastly
>>>> oversimplifies the situation.
>>>>
>>>> The wikipedia definition of write amplification looks at this from the
>>>> perspective of the SSD controller:
>>>> https://en.wikipedia.org/wiki/Write_amplification#Calculating_the_value
>>>>
>>>> In short, write amplification = data written to flash/data written by
>>>> the host
>>>>
>>>> So, if I write 1MB in my application, but the SSD has to write my 1MB,
>>>> plus rearrange another 1MB of data in order to make room for it, then I've
>>>> written a total of 2MB and my write amplification is 2x.
>>>>
>>>> In other words, it is measuring how much extra the SSD controller has
>>>> to write in order to do its own housekeeping.
>>>>
>>>> However, the wikipedia definition is a bit more constrained than how
>>>> the term is used in the storage industry. The whole point of looking at
>>>> write amplification is to understand the impact that a particular workload
>>>> is going to have on the underlying NAND by virtue of the data written. So a
>>>> definition of write amplification that is a little more relevant to the
>>>> context of Cassandra is t

Re: Compaction Filter in Cassandra

2016-03-19 Thread Dikang Gu
Fyi, this is the jira, https://issues.apache.org/jira/browse/CASSANDRA-11348
.

We can move the discussion to the jira if want.

On Thu, Mar 17, 2016 at 11:46 AM, Dikang Gu  wrote:

> Hi Eric,
>
> Thanks for sharing the information!
>
> We also mainly want to use it for trimming data, either by the time or the
> number of columns in a row. We haven't started the work yet, do you mind to
> share some patches? We'd love to try it and test it in our environment.
>
> Thanks.
>
> On Tue, Mar 15, 2016 at 9:36 PM, Eric Stevens  wrote:
>
>> We have been working on filtering compaction for a month or so (though we
>> call it deleting compaction, its implementation is as a filtering
>> compaction strategy).  The feature is nearing completion, and we have used
>> it successfully in a limited production capacity against DSE 4.8 series.
>>
>> Our use case is that our records are written anywhere between a month, up
>> to several years before they are scheduled for deletion.  Tombstones are
>> too expensive, as we have tables with hundreds of billions of rows.  In
>> addition, traditional TTLs don't work for us because our customers are
>> permitted to change their retention policy such that already-written
>> records should not be deleted if they increase their retention after the
>> record was written (or vice versa).
>>
>> We can clean up data more cheaply and more quickly with filtered
>> compaction than with tombstones and traditional compaction.  Our
>> implementation is a wrapper compaction strategy for another underlying
>> strategy, so that you can have the characteristics of whichever strategy
>> makes sense in terms of managing your SSTables, while interceding and
>> removing records during compaction (including cleaning up secondary
>> indexes) that otherwise would have survived into the new SSTable.
>>
>> We are hoping to contribute it back to the community, so if you'd be
>> interested in helping test it out, I'd love to hear from you.
>>
>> On Sat, Mar 12, 2016 at 5:12 AM Marcus Eriksson 
>> wrote:
>>
>>> We don't have anything like that, do you have a specific use case in
>>> mind?
>>>
>>> Could you create a JIRA ticket and we can discuss there?
>>>
>>> /Marcus
>>>
>>> On Sat, Mar 12, 2016 at 7:05 AM, Dikang Gu  wrote:
>>>
>>>> Hello there,
>>>>
>>>> RocksDB has the feature called "Compaction Filter" to allow application
>>>> to modify/delete a key-value during the background compaction.
>>>> https://github.com/facebook/rocksdb/blob/v4.1/include/rocksdb/options.h#L201-L226
>>>>
>>>> I'm wondering is there a plan/value to add this into C* as well? Or is
>>>> there already a similar thing in C*?
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> Dikang
>>>>
>>>>
>>>
>
>
> --
> Dikang
>
>


-- 
Dikang


Re: Compaction Filter in Cassandra

2016-03-19 Thread Dikang Gu
Hi Eric,

Thanks for sharing the information!

We also mainly want to use it for trimming data, either by the time or the
number of columns in a row. We haven't started the work yet, do you mind to
share some patches? We'd love to try it and test it in our environment.

Thanks.

On Tue, Mar 15, 2016 at 9:36 PM, Eric Stevens  wrote:

> We have been working on filtering compaction for a month or so (though we
> call it deleting compaction, its implementation is as a filtering
> compaction strategy).  The feature is nearing completion, and we have used
> it successfully in a limited production capacity against DSE 4.8 series.
>
> Our use case is that our records are written anywhere between a month, up
> to several years before they are scheduled for deletion.  Tombstones are
> too expensive, as we have tables with hundreds of billions of rows.  In
> addition, traditional TTLs don't work for us because our customers are
> permitted to change their retention policy such that already-written
> records should not be deleted if they increase their retention after the
> record was written (or vice versa).
>
> We can clean up data more cheaply and more quickly with filtered
> compaction than with tombstones and traditional compaction.  Our
> implementation is a wrapper compaction strategy for another underlying
> strategy, so that you can have the characteristics of whichever strategy
> makes sense in terms of managing your SSTables, while interceding and
> removing records during compaction (including cleaning up secondary
> indexes) that otherwise would have survived into the new SSTable.
>
> We are hoping to contribute it back to the community, so if you'd be
> interested in helping test it out, I'd love to hear from you.
>
> On Sat, Mar 12, 2016 at 5:12 AM Marcus Eriksson  wrote:
>
>> We don't have anything like that, do you have a specific use case in mind?
>>
>> Could you create a JIRA ticket and we can discuss there?
>>
>> /Marcus
>>
>> On Sat, Mar 12, 2016 at 7:05 AM, Dikang Gu  wrote:
>>
>>> Hello there,
>>>
>>> RocksDB has the feature called "Compaction Filter" to allow application
>>> to modify/delete a key-value during the background compaction.
>>> https://github.com/facebook/rocksdb/blob/v4.1/include/rocksdb/options.h#L201-L226
>>>
>>> I'm wondering is there a plan/value to add this into C* as well? Or is
>>> there already a similar thing in C*?
>>>
>>> Thanks
>>>
>>> --
>>> Dikang
>>>
>>>
>>


-- 
Dikang


Compaction Filter in Cassandra

2016-03-11 Thread Dikang Gu
Hello there,

RocksDB has the feature called "Compaction Filter" to allow application to
modify/delete a key-value during the background compaction.
https://github.com/facebook/rocksdb/blob/v4.1/include/rocksdb/options.h#L201-L226

I'm wondering is there a plan/value to add this into C* as well? Or is
there already a similar thing in C*?

Thanks

-- 
Dikang


Re: NGCC 2016

2016-03-10 Thread Dikang Gu
Awesome! Just registered. I did not attend before, shall I expect to get
response some time later?

Thanks
DIkang.

On Thu, Mar 10, 2016 at 2:09 PM, Jonathan Ellis  wrote:

> And, it's actually June 9-10.  Correct on the form.  Sorry!
>
> On Thu, Mar 10, 2016 at 3:48 PM, Jonathan Ellis  wrote:
>
> > ... and here's an actual working link: http://goo.gl/forms/Ec6DdNFD6h
> >
> > On Thu, Mar 10, 2016 at 3:47 PM, Jonathan Ellis 
> wrote:
> >
> >> Hi all,
> >>
> >> This year's Next Generation Cassandra Conference will be June 8-9 (yes,
> >> two days!) in Austin, Texas.  The first day will be a single track of
> >> prepared presentations, similar to previous years'.  The second day
> will be
> >> reserved for followup discussions in an "unconference" format.
> >>
> >> Details and registration:
> >>
> https://docs.google.com/forms/d/11CMRuTlUeyPd_X9Oo11X7_xx8RAJEzZ-HOVc5p079c8/edit?usp=forms_home&ths=true
> >>
> >> See you there!
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder, http://www.datastax.com
> >> @spyced
> >>
> >
> >
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder, http://www.datastax.com
> > @spyced
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced
>



-- 
Dikang


How to measure the write amplification of C*?

2016-03-09 Thread Dikang Gu
Hello there,

I'm wondering is there a good way to measure the write amplification of
Cassandra?

I'm thinking it could be calculated by (size of mutations written to the
node)/(number of bytes written to the disk).

Do we already have the metrics of "size of mutations written to the node"?
I did not find it in jmx metrics.

Thanks

-- 
Dikang