In addition to what Zhen suggests, I'm also curious how you are sizing your
batches in manual-flush mode? With 128 hash partitions, each batch is
generating 128 RPCs, so if for example you are only batching 1000 rows at a
time, you'll end up with a lot of fixed overhead in each RPC to insert just
1000/128 = ~8 rows.

Generally I would expect an 8 node cluster (even with HDDs) to be able to
sustain several hundred thousand rows/second insert rate. Of course, it
depends on the size of the rows and also the primary key you've chosen. If
your primary key is generally increasing (such as the kafka sequence
number) then you should have very little compaction and good performance.

-Todd

On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zhqu...@gmail.com> wrote:

> Maybe you can add your consumer number? In my opinion, more threads to
> insert can give a better throughput.
>
> 2017-10-31 15:07 GMT+08:00 Chao Sun <sunc...@uber.com>:
>
>> OK. Thanks! I changed to manual flush mode and it increased to ~15K /
>> sec. :)
>>
>> Is there any other tuning I can do to further improve this? and also, how
>> much would
>> SSD help in this case (only upsert)?
>>
>> Thanks again,
>> Chao
>>
>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>>> If you want to manage batching yourself you can use the manual flush
>>> mode. Easiest would be the auto flush background mode.
>>>
>>> Todd
>>>
>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <sunc...@uber.com> wrote:
>>>
>>>> Hi Todd,
>>>>
>>>> Thanks for the reply! I used a single Kafka consumer to pull the data.
>>>> For Kudu, I was doing something very simple that basically just follow
>>>> the example here
>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>> .
>>>> In specific:
>>>>
>>>> loop {
>>>>   Insert insert = kuduTable.newInsert();
>>>>   PartialRow row = insert.getRow();
>>>>   // fill the columns
>>>>   kuduSession.apply(insert)
>>>> }
>>>>
>>>> I didn't specify the flushing mode, so it will pick up the
>>>> AUTO_FLUSH_SYNC as default?
>>>> should I use MANUAL_FLUSH?
>>>>
>>>> Thanks,
>>>> Chao
>>>>
>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <t...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hey Chao,
>>>>>
>>>>> Nice to hear you are checking out Kudu.
>>>>>
>>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>>> possible that it is Java code and you are using the SYNC flush mode? That
>>>>> would result in a separate round trip for each record and thus very low
>>>>> throughput.
>>>>>
>>>>> Todd
>>>>>
>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <sunc...@uber.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>> The data are coming from Kafka at a rate of around 30K / sec, and hash
>>>>> partitioned into 128 buckets. However, with default settings, Kudu can 
>>>>> only
>>>>> consume the topics at a rate of around 1.5K / second. This is a direct
>>>>> ingest with no transformation on the data.
>>>>>
>>>>> Could this because I was using the default configurations? also we are
>>>>> using Kudu on HDD - could that also be related?
>>>>>
>>>>> Any help would be appreciated. Thanks.
>>>>>
>>>>> Best,
>>>>> Chao
>>>>>
>>>>>
>>>>>
>>>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to