Re: Proposal: remove default partitioning for new tables
I think this a very reasonable feature request. I have recently started working with Kudu and the "default" behavior has already tripped me up a couple times. Thanks, Abhi On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <danburk...@apache.org> wrote: > Hi all, > > One of the issues that trips up new Kudu users is the uncertainty about > how partitioning works, and how to use partitioning effectively. Much of > this can be addressed with better documentation and explanatory materials, > and that should be an area of focus leading up to our 1.0 release. However, > the default partitioning behavior is suboptimal, and changing the default > could lead to significantly less user confusion and frustration. Currently, > when creating a new table, Kudu defaults to using only a single tablet, > which is a known anti-pattern. This can be painful for users who create a > table assuming Kudu will have good defaults, and begin loading data only to > find out later that they will need to recreate the table with partitioning > to achieve good results. > > A better default partitioning strategy might be hash partitioning over the > primary key columns, with a number of hash buckets based on the number of > tablet servers (perhaps something like 3x the number of tablet servers). > This would alleviate the worst scalability issues with the current default, > however it has a few downsides of its own. Hash partitioning is not > appropriate for every use case, and any rule-of-thumb number of tablets we > could come up with will not always be optimal. > > Given that there is no bullet-proof default, and that changing > partitioning strategy after table creation is impossible, and changing the > default partitioning strategy is a backwards incompatible change, I propose > we remove the default altogether. Users would be required to explicitly > specify the table partitioning during creation, and failing to do so would > result in an illegal argument error. Users who really do want only a > single tablet will still be able to do so by explicitly configuring range > partitioning with no split rows. > > I'd like to get community feedback on whether this seems like a good > direction to take. I have put together a patch, you can check out the > changes to test files to see what it looks like to add partitioning > explicitly in cases where the default was being relied on. > http://gerrit.cloudera.org:8080/#/c/3131/ > > - Dan > -- Abhi Basu
Re: CDH 5.5 - Kudu error not enough space remaining in buffer for op
I have tried with batch_size=500 and still get same error. For your reference are attached info that may help diagnose. Error: Error while applying Kudu session.: Incomplete: not enough space remaining in buffer for op (required 46.7K, 7.00M already used Config settings: Kudu Tablet Server Block Cache Capacity 1 GB Kudu Tablet Server Hard Memory Limit 16 GB On Wed, May 18, 2016 at 8:26 AM, William Berkeley <wdberke...@cloudera.com> wrote: > Both options are more or less the same idea- the point is you need less > rows going in per batch so you don't go over the batch size limit. Follow > what Todd said as he explained it more clearly and suggested a better way. > > -Will > > On Wed, May 18, 2016 at 10:45 AM, Abhi Basu <9000r...@gmail.com> wrote: > >> Thanks for the updates. I will give both options a try and report back. >> >> If you are interested in testing with such datasets, I can help. >> >> Thanks, >> >> Abhi >> >> On Wed, May 18, 2016 at 6:25 AM, Todd Lipcon <t...@cloudera.com> wrote: >> >>> Hi Abhi, >>> >>> Will is right that the error is client-side, and probably happening >>> because your rows are so wide.Impala typically will batch 1000 rows at a >>> time when inserting into Kudu, so if each of your rows is 7-8KB, that will >>> overflow the max buffer size that Will mentioned. This seems quite probable >>> if your data is 1000 columns of doubles or int64s (which are 8 bytes each). >>> >>> I don't think his suggested workaround will help, but you can try >>> running 'set batch_size=500' before running the create table or insert >>> query. >>> >>> In terms of max supported columns, most of the workloads we are focusing >>> on are more like typical data-warehouse tables, on the order of a couple >>> hundred columns. Crossing into the 1000+ range enters "uncharted territory" >>> where it's much more likely you'll hit problems like this and quite >>> possibly others as well. Will be interested to hear your experiences, >>> though you should probably be prepared for some rough edges. >>> >>> -Todd >>> >>> On Tue, May 17, 2016 at 8:32 PM, William Berkeley < >>> wdberke...@cloudera.com> wrote: >>> >>>> Hi Abhi. >>>> >>>> I believe that error is actually coming from the client, not the >>>> server. See e,g, >>>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/batcher.cc#L787 >>>> (NB >>>> that link is to master branch not the exact release you are using). >>>> >>>> If you look around there, you'll see that the max is set by something >>>> called max_buffer_size_, which appears to be hardcoded to 7 * 1024 * 1024 >>>> bytes = 7MiB (and this is consistent with 6.96 + 0.0467 > 7). >>>> >>>> I think the simple workaround would be to do the CTAS as a CTAS + >>>> insert as select. Pick a condition that bipartitions the table, so you >>>> don't get errors trying to double insert rows. >>>> >>>> -Will >>>> >>>> On Tue, May 17, 2016 at 4:45 PM, Abhi Basu <9000r...@gmail.com> wrote: >>>> >>>>> What is the limit of columns in Kudu? >>>>> >>>>> I am using 1000 gen dataset, specifically the chr22 table which has >>>>> 500,000 rows x 1101 columns. This table has been built In Impala/HDFS. I >>>>> am >>>>> trying to create a new Kudu table as select from that table. I get the >>>>> following error: >>>>> >>>>> Error while applying Kudu session.: Incomplete: not enough space >>>>> remaining in buffer for op (required 46.7K, 6.96M already used >>>>> >>>>> When looking at http://pcsd-cdh2.local.com:8051/mem-trackers, I see >>>>> the following. What configuration needs to be tweaked? >>>>> >>>>> >>>>> Memory usage by subsystem >>>>> IdParentLimitCurrent ConsumptionPeak consumption >>>>> root none 50.12G 4.97M 6.08M >>>>> block_cache-sharded_lru_cache root none 937.9K 937.9K >>>>> code_cache-sharded_lru_cache root none 1B 1B >>>>> server root none 2.3K 201.4K >>>>> tablet- server none 530B 200.1K >>>>> MemRowSet-6 tablet- none 265B 265B >>>>> txn_tracker tablet- 64.00M 0
Re: CDH 5.5 - Kudu error not enough space remaining in buffer for op
Thanks for the updates. I will give both options a try and report back. If you are interested in testing with such datasets, I can help. Thanks, Abhi On Wed, May 18, 2016 at 6:25 AM, Todd Lipcon <t...@cloudera.com> wrote: > Hi Abhi, > > Will is right that the error is client-side, and probably happening > because your rows are so wide.Impala typically will batch 1000 rows at a > time when inserting into Kudu, so if each of your rows is 7-8KB, that will > overflow the max buffer size that Will mentioned. This seems quite probable > if your data is 1000 columns of doubles or int64s (which are 8 bytes each). > > I don't think his suggested workaround will help, but you can try running > 'set batch_size=500' before running the create table or insert query. > > In terms of max supported columns, most of the workloads we are focusing > on are more like typical data-warehouse tables, on the order of a couple > hundred columns. Crossing into the 1000+ range enters "uncharted territory" > where it's much more likely you'll hit problems like this and quite > possibly others as well. Will be interested to hear your experiences, > though you should probably be prepared for some rough edges. > > -Todd > > On Tue, May 17, 2016 at 8:32 PM, William Berkeley <wdberke...@cloudera.com > > wrote: > >> Hi Abhi. >> >> I believe that error is actually coming from the client, not the server. >> See e,g, >> https://github.com/apache/incubator-kudu/blob/master/src/kudu/client/batcher.cc#L787 >> (NB >> that link is to master branch not the exact release you are using). >> >> If you look around there, you'll see that the max is set by something >> called max_buffer_size_, which appears to be hardcoded to 7 * 1024 * 1024 >> bytes = 7MiB (and this is consistent with 6.96 + 0.0467 > 7). >> >> I think the simple workaround would be to do the CTAS as a CTAS + insert >> as select. Pick a condition that bipartitions the table, so you don't get >> errors trying to double insert rows. >> >> -Will >> >> On Tue, May 17, 2016 at 4:45 PM, Abhi Basu <9000r...@gmail.com> wrote: >> >>> What is the limit of columns in Kudu? >>> >>> I am using 1000 gen dataset, specifically the chr22 table which has >>> 500,000 rows x 1101 columns. This table has been built In Impala/HDFS. I am >>> trying to create a new Kudu table as select from that table. I get the >>> following error: >>> >>> Error while applying Kudu session.: Incomplete: not enough space >>> remaining in buffer for op (required 46.7K, 6.96M already used >>> >>> When looking at http://pcsd-cdh2.local.com:8051/mem-trackers, I see the >>> following. What configuration needs to be tweaked? >>> >>> >>> Memory usage by subsystem >>> IdParentLimitCurrent ConsumptionPeak consumption >>> root none 50.12G 4.97M 6.08M >>> block_cache-sharded_lru_cache root none 937.9K 937.9K >>> code_cache-sharded_lru_cache root none 1B 1B >>> server root none 2.3K 201.4K >>> tablet-0000 server none 530B 200.1K >>> MemRowSet-6 tablet- none 265B 265B >>> txn_tracker tablet- 64.00M 0B 28.5K >>> DeltaMemStores tablet- none 265B 87.8K >>> log_block_manager server none 1.8K 2.7K >>> >>> Thanks, >>> -- >>> Abhi Basu >>> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Abhi Basu