Re: Proposal: remove default partitioning for new tables

2016-05-19 Thread Abhi Basu
I think this a very reasonable feature request. I have recently started
working with Kudu and the "default" behavior has already tripped me up a
couple times.

Thanks,

Abhi

On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <danburk...@apache.org> wrote:

> Hi all,
>
> One of the issues that trips up new Kudu users is the uncertainty about
> how partitioning works, and how to use partitioning effectively.  Much of
> this can be addressed with better documentation and explanatory materials,
> and that should be an area of focus leading up to our 1.0 release. However,
> the default partitioning behavior is suboptimal, and changing the default
> could lead to significantly less user confusion and frustration. Currently,
> when creating a new table, Kudu defaults to using only a single tablet,
> which is a known anti-pattern.  This can be painful for users who create a
> table assuming Kudu will have good defaults, and begin loading data only to
> find out later that they will need to recreate the table with partitioning
> to achieve good results.
>
> A better default partitioning strategy might be hash partitioning over the
> primary key columns, with a number of hash buckets based on the number of
> tablet servers (perhaps something like 3x the number of tablet servers).
> This would alleviate the worst scalability issues with the current default,
> however it has a few downsides of its own. Hash partitioning is not
> appropriate for every use case, and any rule-of-thumb number of tablets we
> could come up with will not always be optimal.
>
> Given that there is no bullet-proof default, and that changing
> partitioning strategy after table creation is impossible, and changing the
> default partitioning strategy is a backwards incompatible change, I propose
> we remove the default altogether.  Users would be required to explicitly
> specify the table partitioning during creation, and failing to do so would
> result in an illegal argument error.  Users who really do want only a
> single tablet will still be able to do so by explicitly configuring range
> partitioning with no split rows.
>
> I'd like to get community feedback on whether this seems like a good
> direction to take.  I have put together a patch, you can check out the
> changes to test files to see what it looks like to add partitioning
> explicitly in cases where the default was being relied on.
> http://gerrit.cloudera.org:8080/#/c/3131/
>
> - Dan
>



-- 
Abhi Basu


Re: CDH 5.5 - Kudu error not enough space remaining in buffer for op

2016-05-18 Thread Abhi Basu
I have tried with batch_size=500 and still get same error. For your
reference are attached info that may help diagnose.

Error: Error while applying Kudu session.: Incomplete: not enough space
remaining in buffer for op (required 46.7K, 7.00M already used


Config settings:

Kudu Tablet Server Block Cache Capacity   1 GB
Kudu Tablet Server Hard Memory Limit  16 GB


On Wed, May 18, 2016 at 8:26 AM, William Berkeley <wdberke...@cloudera.com>
wrote:

> Both options are more or less the same idea- the point is you need less
> rows going in per batch so you don't go over the batch size limit. Follow
> what Todd said as he explained it more clearly and suggested a better way.
>
> -Will
>
> On Wed, May 18, 2016 at 10:45 AM, Abhi Basu <9000r...@gmail.com> wrote:
>
>> Thanks for the updates. I will give both options a try and report back.
>>
>> If you are interested in testing with such datasets, I can help.
>>
>> Thanks,
>>
>> Abhi
>>
>> On Wed, May 18, 2016 at 6:25 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>>> Hi Abhi,
>>>
>>> Will is right that the error is client-side, and probably happening
>>> because your rows are so wide.Impala typically will batch 1000 rows at a
>>> time when inserting into Kudu, so if each of your rows is 7-8KB, that will
>>> overflow the max buffer size that Will mentioned. This seems quite probable
>>> if your data is 1000 columns of doubles or int64s (which are 8 bytes each).
>>>
>>> I don't think his suggested workaround will help, but you can try
>>> running 'set batch_size=500' before running the create table or insert
>>> query.
>>>
>>> In terms of max supported columns, most of the workloads we are focusing
>>> on are more like typical data-warehouse tables, on the order of a couple
>>> hundred columns. Crossing into the 1000+ range enters "uncharted territory"
>>> where it's much more likely you'll hit problems like this and quite
>>> possibly others as well. Will be interested to hear your experiences,
>>> though you should probably be prepared for some rough edges.
>>>
>>> -Todd
>>>
>>> On Tue, May 17, 2016 at 8:32 PM, William Berkeley <
>>> wdberke...@cloudera.com> wrote:
>>>
>>>> Hi Abhi.
>>>>
>>>> I believe that error is actually coming from the client, not the
>>>> server. See e,g,
>>>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/batcher.cc#L787
>>>>  (NB
>>>> that link is to master branch not the exact release you are using).
>>>>
>>>> If you look around there, you'll see that the max is set by something
>>>> called max_buffer_size_, which appears to be hardcoded to 7 * 1024 * 1024
>>>> bytes = 7MiB (and this is consistent with 6.96 + 0.0467 > 7).
>>>>
>>>> I think the simple workaround would be to do the CTAS as a CTAS +
>>>> insert as select. Pick a condition that bipartitions the table, so you
>>>> don't get errors trying to double insert rows.
>>>>
>>>> -Will
>>>>
>>>> On Tue, May 17, 2016 at 4:45 PM, Abhi Basu <9000r...@gmail.com> wrote:
>>>>
>>>>> What is the limit of columns in Kudu?
>>>>>
>>>>> I am using 1000 gen dataset, specifically the chr22 table which has
>>>>> 500,000 rows x 1101 columns. This table has been built In Impala/HDFS. I 
>>>>> am
>>>>> trying to create a new Kudu table as select from that table. I get the
>>>>> following error:
>>>>>
>>>>> Error while applying Kudu session.: Incomplete: not enough space
>>>>> remaining in buffer for op (required 46.7K, 6.96M already used
>>>>>
>>>>> When looking at http://pcsd-cdh2.local.com:8051/mem-trackers, I see
>>>>> the following. What configuration needs to be tweaked?
>>>>>
>>>>>
>>>>> Memory usage by subsystem
>>>>> IdParentLimitCurrent ConsumptionPeak consumption
>>>>> root none 50.12G 4.97M 6.08M
>>>>> block_cache-sharded_lru_cache root none 937.9K 937.9K
>>>>> code_cache-sharded_lru_cache root none 1B 1B
>>>>> server root none 2.3K 201.4K
>>>>> tablet- server none 530B 200.1K
>>>>> MemRowSet-6 tablet- none 265B 265B
>>>>> txn_tracker tablet- 64.00M 0

Re: CDH 5.5 - Kudu error not enough space remaining in buffer for op

2016-05-18 Thread Abhi Basu
Thanks for the updates. I will give both options a try and report back.

If you are interested in testing with such datasets, I can help.

Thanks,

Abhi

On Wed, May 18, 2016 at 6:25 AM, Todd Lipcon <t...@cloudera.com> wrote:

> Hi Abhi,
>
> Will is right that the error is client-side, and probably happening
> because your rows are so wide.Impala typically will batch 1000 rows at a
> time when inserting into Kudu, so if each of your rows is 7-8KB, that will
> overflow the max buffer size that Will mentioned. This seems quite probable
> if your data is 1000 columns of doubles or int64s (which are 8 bytes each).
>
> I don't think his suggested workaround will help, but you can try running
> 'set batch_size=500' before running the create table or insert query.
>
> In terms of max supported columns, most of the workloads we are focusing
> on are more like typical data-warehouse tables, on the order of a couple
> hundred columns. Crossing into the 1000+ range enters "uncharted territory"
> where it's much more likely you'll hit problems like this and quite
> possibly others as well. Will be interested to hear your experiences,
> though you should probably be prepared for some rough edges.
>
> -Todd
>
> On Tue, May 17, 2016 at 8:32 PM, William Berkeley <wdberke...@cloudera.com
> > wrote:
>
>> Hi Abhi.
>>
>> I believe that error is actually coming from the client, not the server.
>> See e,g,
>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/batcher.cc#L787
>>  (NB
>> that link is to master branch not the exact release you are using).
>>
>> If you look around there, you'll see that the max is set by something
>> called max_buffer_size_, which appears to be hardcoded to 7 * 1024 * 1024
>> bytes = 7MiB (and this is consistent with 6.96 + 0.0467 > 7).
>>
>> I think the simple workaround would be to do the CTAS as a CTAS + insert
>> as select. Pick a condition that bipartitions the table, so you don't get
>> errors trying to double insert rows.
>>
>> -Will
>>
>> On Tue, May 17, 2016 at 4:45 PM, Abhi Basu <9000r...@gmail.com> wrote:
>>
>>> What is the limit of columns in Kudu?
>>>
>>> I am using 1000 gen dataset, specifically the chr22 table which has
>>> 500,000 rows x 1101 columns. This table has been built In Impala/HDFS. I am
>>> trying to create a new Kudu table as select from that table. I get the
>>> following error:
>>>
>>> Error while applying Kudu session.: Incomplete: not enough space
>>> remaining in buffer for op (required 46.7K, 6.96M already used
>>>
>>> When looking at http://pcsd-cdh2.local.com:8051/mem-trackers, I see the
>>> following. What configuration needs to be tweaked?
>>>
>>>
>>> Memory usage by subsystem
>>> IdParentLimitCurrent ConsumptionPeak consumption
>>> root none 50.12G 4.97M 6.08M
>>> block_cache-sharded_lru_cache root none 937.9K 937.9K
>>> code_cache-sharded_lru_cache root none 1B 1B
>>> server root none 2.3K 201.4K
>>> tablet-0000 server none 530B 200.1K
>>> MemRowSet-6 tablet- none 265B 265B
>>> txn_tracker tablet- 64.00M 0B 28.5K
>>> DeltaMemStores tablet- none 265B 87.8K
>>> log_block_manager server none 1.8K 2.7K
>>>
>>> Thanks,
>>> --
>>> Abhi Basu
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Abhi Basu