Todd,

We are not going crazy with normalization. Actually, we are only normalizing 
where necessary. For example, we have a table for profiles and behaviors. They 
are joined together by a behavior status table. Each one of these tables are 
de-normalized when it comes to basic attributes. That’s the extent of it. From 
the sound of it, it looks like we are good for now.

Thanks,
Ben


> On Oct 10, 2016, at 4:15 PM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Hey Ben,
> 
> Yea, we currently don't do great with very wide tables. For example, on 
> flushes, we'll separately write and fsync each of the underlying columns, so 
> if you have hundreds, it can get very expensive. Another factor is that 
> currently every 'Write' RPC actually contains the full schema information for 
> all columns, regardless of whether you've set them for a particular row.
> 
> I'm sure we'll make improvements in these areas in the coming months/years, 
> but for now, the recommendation is to stick with a schema that looks more 
> like an RDBMS schema than an HBase one.
> 
> However, I wouldn't go _crazy_ on normalization. For example, I wouldn't 
> bother normalizing out a 'date' column into a 'date_id' and separate 'dates' 
> table, as one might have done in a fully normalized RDBMS table in days of 
> yore. Kudu's columnar layout, in conjunction with encodings like dictionary 
> encoding, make that kind of normalization ineffective or even 
> counter-productive as they introduce extra joins and query-time complexity.
> 
> One other item to note is that with more normalized schemas, it requires more 
> of your query engine's planning capabilities. If you aren't doing joins, a 
> very dumb query planner is fine. If you're doing complex joins across 10+ 
> tables, then the quality of plans makes an enormous difference in query 
> performance. To speak in concrete terms, I would guess that with more heavily 
> normalized schemas, Impala's query planner would do a lot better job than 
> Spark's, given that we don't currently expose information on table sizes to 
> Spark and thus it's likely to do a poor job of join ordering.
> 
> Hope that helps
> 
> -Todd
> 
> 
> On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I would like to know if normalization techniques should or should not be 
> necessary when modeling table schemas in Kudu. I read that a table with 
> around 50 columns is ideal. This would mean a very wide table should be 
> avoided.
> 
> Thanks,
> Ben
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Reply via email to