On Thu, Mar 15, 2018 at 8:32 PM, 张晓宁 <zhangxiaon...@jd.com> wrote:
> Thank you Dan! My follow-up comments with XiaoNing. > > > > *发件人:* Dan Burkert [mailto:danburk...@apache.org] > *发送时间:* 2018年3月16日 1:06 > *收件人:* user@kudu.apache.org > *主题:* Re: A few questions for using Kudu > > > > Hi, answers inline: > > On Thu, Mar 15, 2018 at 3:12 AM, 张晓宁 <zhangxiaon...@jd.com> wrote: > > I have a few questions for using kudu: > > 1. As more and more data inserted to kudu, the performance > decrease. After continuous data insertion for about 30 minutes, the TPS > performance decreased with 20%, and after 1-hour data insertion, the > performance decreased with 40%. Is this a known issue? > > This is expected if you are inserting data in random order. If you try > another benchmark where you insert data in primary key sorted order, you'll > see that the performance will be much higher, and more consistent. If you > have a heavy insert workload, this kind of optimization is critical. The > table's partitioning and primary key can often be designed to make this > happen naturally, but it's a dataset dependent thing, so without more > specifics about your data it's difficult to give more precise advice. > > XiaoNing: Our table has 2 partitions,the first level partition is by > date range(using the column timestamp),one partition for one single day, > and the second partition is by a hash on 2 column(key + host).These 3 > columns(timestamp,key,host) are the primary key of the table.For you > comment “insert data in primary key sorted order”,do you mean we need to > sort the data on the 3 primary-key columns before insertion? > If timestamp is the first column then it should probably be somewhat naturally-sorted by the primary key, right? It doesn't need to be perfectly sorted, but if the inserts are in roughly PK order, we will avoid unnecessary compaction. > 2. When setting the replica number to be 1, totally I will have 2 > copy of data(1 master data + 1 replica data), is this true? > > That's incorrect. The master node does not hold any table data. If you > set the number of replicas to be 1, you will lose data if you lose the > tablet server which holds the replica. We always recommend production > workloads set number of replicas to 3 in order to have fault tolerance. > > XiaoNing: So if we want to have fault tolerance, we should at least set > the replica number to be 3, right? > That's right. -Todd -- Todd Lipcon Software Engineer, Cloudera