Re: Adding some guard rails to Kudu
+1 Very reasonable approach and like that there is a semi-hard to use safety valve. On Wednesday, November 30, 2016, Todd Lipcon wrote: > BTW I filed a JIRA here and started linking related issues to it: > https://issues.apache.org/jira/browse/KUDU-1775 > > > On Wed, Nov 30, 2016 at 3:25 PM, Todd Lipcon > wrote: > > > Hey folks, > > > > I've started working on a few patches to add "guard rails" to various > > user-specified dimensions in Kudu. In particular, I'm planning to add > > limits to the following: > > > > - max number of columns in a table (proposal: 300) > > - max replication factor (proposal: 7) > > - max table name or column name length (proposal: 256) > > - max size of a binary/string column cell value (proposal: 64kb) > > > > The reasoning is that, even though in some cases we don't know a specific > > issue that will happen outside these limits, we've done very little > testing > > (and have no automated testing) outside of these ranges. In some cases, > we > > do know that there is a certain threshold that will cause a big problem > (eg > > large cell sizes can cause tablet servers to crash). In other cases, it's > > just "unknown territory". > > > > In all cases, I'm planning on making the limits overridable via an > > "unsafe" configuration flag. That means that a user can run with > > "--unlock_unsafe_flags --max_identifier_length=1000" if they want to, but > > they're explicitly accepting some risk that they're entering untested > > territory. > > > > Of course, in all cases, if we hear that there are people who are bumping > > the maxes higher than the defaults and having good results, we can > consider > > raising the maximum, but I think it's smarter to start conservatively low > > and raise later as we increase test coverage. Also, I'm sure down the > road > > we'll add features such as BLOB support or sparse column support, and at > > that time we can remove the corresponding guard rails. > > > > I'm sending this note to both user@ and dev@ to solicit feedback. Are > > there any other dimensions people can think of where we should probably > add > > guard-rails? Is anyone out there already outside of the above ranges and > > can make a case that we're being too conservative? > > > > Thanks > > -Todd > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
Re: Adding some guard rails to Kudu
BTW I filed a JIRA here and started linking related issues to it: https://issues.apache.org/jira/browse/KUDU-1775 On Wed, Nov 30, 2016 at 3:25 PM, Todd Lipcon wrote: > Hey folks, > > I've started working on a few patches to add "guard rails" to various > user-specified dimensions in Kudu. In particular, I'm planning to add > limits to the following: > > - max number of columns in a table (proposal: 300) > - max replication factor (proposal: 7) > - max table name or column name length (proposal: 256) > - max size of a binary/string column cell value (proposal: 64kb) > > The reasoning is that, even though in some cases we don't know a specific > issue that will happen outside these limits, we've done very little testing > (and have no automated testing) outside of these ranges. In some cases, we > do know that there is a certain threshold that will cause a big problem (eg > large cell sizes can cause tablet servers to crash). In other cases, it's > just "unknown territory". > > In all cases, I'm planning on making the limits overridable via an > "unsafe" configuration flag. That means that a user can run with > "--unlock_unsafe_flags --max_identifier_length=1000" if they want to, but > they're explicitly accepting some risk that they're entering untested > territory. > > Of course, in all cases, if we hear that there are people who are bumping > the maxes higher than the defaults and having good results, we can consider > raising the maximum, but I think it's smarter to start conservatively low > and raise later as we increase test coverage. Also, I'm sure down the road > we'll add features such as BLOB support or sparse column support, and at > that time we can remove the corresponding guard rails. > > I'm sending this note to both user@ and dev@ to solicit feedback. Are > there any other dimensions people can think of where we should probably add > guard-rails? Is anyone out there already outside of the above ranges and > can make a case that we're being too conservative? > > Thanks > -Todd > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
Adding some guard rails to Kudu
Hey folks, I've started working on a few patches to add "guard rails" to various user-specified dimensions in Kudu. In particular, I'm planning to add limits to the following: - max number of columns in a table (proposal: 300) - max replication factor (proposal: 7) - max table name or column name length (proposal: 256) - max size of a binary/string column cell value (proposal: 64kb) The reasoning is that, even though in some cases we don't know a specific issue that will happen outside these limits, we've done very little testing (and have no automated testing) outside of these ranges. In some cases, we do know that there is a certain threshold that will cause a big problem (eg large cell sizes can cause tablet servers to crash). In other cases, it's just "unknown territory". In all cases, I'm planning on making the limits overridable via an "unsafe" configuration flag. That means that a user can run with "--unlock_unsafe_flags --max_identifier_length=1000" if they want to, but they're explicitly accepting some risk that they're entering untested territory. Of course, in all cases, if we hear that there are people who are bumping the maxes higher than the defaults and having good results, we can consider raising the maximum, but I think it's smarter to start conservatively low and raise later as we increase test coverage. Also, I'm sure down the road we'll add features such as BLOB support or sparse column support, and at that time we can remove the corresponding guard rails. I'm sending this note to both user@ and dev@ to solicit feedback. Are there any other dimensions people can think of where we should probably add guard-rails? Is anyone out there already outside of the above ranges and can make a case that we're being too conservative? Thanks -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Good way to find "Real" size of the tables
On Wed, Nov 30, 2016 at 6:26 AM, Weber, Richard wrote: > Hi All, > > I'm trying to figure out the right/best/easiest way to find out how much > space that a given table is taking up on the various tablet servers. I'm > looking really at finding: > * Physical space taken on all disks > * Logical space taken on all disks > * Sizing of Indices/Bloom Filters, etc. > * Sizing with and without replication. > > I'm trying to run an apples vs apples comparison of how big data is when > stored in Kudu compared to storing it in it's native format (Gzipped CSV) > as well as in Parquet format on HDFS. Ultimately, I'd like to be able to > do reporting on the different tables to say Table X is taking up Y Tb, > where Y consists of A physical size, B Index, C Bloom, etc. > > Looking through the Web UI I don't really see any good summary of how much > space the entire table is taking. It seems like I'd need to walk through > each Tablet server, connect to the metrics page and generate the summary > information myself. > > Yea, unfortunately we do not expose much of this information in a useful way at the moment. The metrics page is the best source of info for the various sizes, and even those are often estimates rather than always being accurate at the moment. In terms of cross-server metrics aggregation, it's been our philosophy so far that we should try to avoid doing a poor job of things that other systems are likely to do better -- metrics aggregation being one such thing. It's likely we'll add simple aggregation of table sizes, since that info is very useful for SQL engines to do JOIN ordering, but I don't think we'd start adding the more granular breakdowns like indexes, blooms, etc. If your use case is a one-time experiment to understand the data volumes, it would be pretty straightforward to write a tool to do this kind of summary against the on-disk metadata of a tablet server. For example, you can load the tablet metadata, group the blocks by type/column, and then aggregate as you prefer. Unfortunately this would give you only the physical size and not the logical, since you'd have to scan the actual data to know its uncompressed sizes. If you have any interest in helping to build such a tool I'd be happy to point you in the right direction. Otherwise let's file a JIRA to add this as a new feature in a future release. -Todd -- Todd Lipcon Software Engineer, Cloudera
Good way to find "Real" size of the tables
Hi All, I'm trying to figure out the right/best/easiest way to find out how much space that a given table is taking up on the various tablet servers. I'm looking really at finding: * Physical space taken on all disks * Logical space taken on all disks * Sizing of Indices/Bloom Filters, etc. * Sizing with and without replication. I'm trying to run an apples vs apples comparison of how big data is when stored in Kudu compared to storing it in it's native format (Gzipped CSV) as well as in Parquet format on HDFS. Ultimately, I'd like to be able to do reporting on the different tables to say Table X is taking up Y Tb, where Y consists of A physical size, B Index, C Bloom, etc. Looking through the Web UI I don't really see any good summary of how much space the entire table is taking. It seems like I'd need to walk through each Tablet server, connect to the metrics page and generate the summary information myself. Am I overlooking something? --Rick Weber riwe...@akamai.com smime.p7s Description: S/MIME cryptographic signature
Impala-KUDU debian 8 support
Hi, I have managed to deploy KUDU to cluster (on debian jessie), but was unable to install Impala-KUDU (using CM). I have found out it is because there is no parcel for Impala-KUDU for debian Jessie (only Ubuntu thrusty and RHEL6, 7). Is there any workaround available or do you have any information about future Debian 8 support? Thanks. smime.p7s Description: S/MIME cryptographic signature