I had to pull over to email, driving to lunch and can't decide what to eat.
I just finished compiling fromany resources around 200,000 Cemetery records throughout the USA. Because I had many sources, I created a single table to house all of the data. This way I didn't worry about any relationships. I did, however, normalize data formatting for easier relationship building later. States were two characters long, phone numbers stripped of alpha and special characters, cities verified for spelling. This was only to plan for future web normalizing. I agree it's about your available resources, and plans for usage. For gathering data, I recommend the flat-file-in-a-RDBMS approach. For warehousing, some relationships may bring benefits over none. For web, go all out. It's about your application and resources. For your project, pulling from different sources, I'd recommend either a single table, or, put all data into the same database tables and relationship constraints while adding a key to identify the source of the data-- so you don't need to put them each in their own database. How's that? Sent from my iPhone On May 20, 2010, at 1:18 PM, Victor Villa <[email protected]> wrote: > William, got to tell ya, we at #uphpu have been debating your post > for the last few minutes. > > For data harvesting, and for reporting, it's best to not normalize > your data > but to store it all in a single table. After, you can normalize it > (or for > data reporting, leave it as is). > > Next time, a single table for harvesting, a merge of all data, then > normalize into other tables :) > > After having had normalization pounded into my head the first > several years of my programming life, i have to admit, it's the only > way i think. > > I wonder if you or anybody else would like to comment on the > discussion. > > [12:55] <@mindjuju> not normalize data? > [12:55] <@mindjuju> scary > [13:02] <@mindjuju> that's curious, i've reread will like 3x, and I > still think it's a little off. I've always thought that if youj're > collecting data from the web that's going to be used for the web, it > should be stored relationally, and that if it is going to be data > warehouse directly, it should be stored suited for reporting, and if > it is both, for efficency purposes, store as relational for web and > convert to non-normalized for dbwarehouse > [13:02] <@mindjuju> though truth of it, i've never had to prep data > for a specific data warehouse structure > [13:03] <@mindjuju> i do have large sums of data i collect, but id > on't think enough to constitute a warehouse > [13:03] <+josephscott> I think in most cases the largest factor is > the amount of data you are storing > [13:04] <+josephscott> a reasonably normalized DB can do pretty much > anything you need when the size of the data is relatively small > [13:06] <@mindjuju> so things break down with larger dataset > josephscott? > [13:06] <+josephscott> and given the increase is today's computing > power, large can often mean millions of rows > [13:07] <+josephscott> there are different pain points that come > into play as the amount of data gets huge > [13:07] <@mindjuju> so you're saying in some cases this is better? | > name|rank|serial|address|skill1|skill2|skill3| > [13:07] <+josephscott> for instance, you end up with different > backup/restore methods when data size is huge > [13:08] <+josephscott> and queries too, when things don't fit into > memory any more things gets slow/ugly > [13:08] <+josephscott> mindjuju: I'm saying in some cases it could be > [13:08] <@mindjuju> curious > [13:08] <+josephscott> and hopefully you've got a smart person to > figure out if your particular situation is one of those > [13:09] <+josephscott> tech isn't about finding one neat technique > and applying it to everything, it's about figuring out what your > needs/issues/pain points are and designing to deal with them > [13:10] <+josephscott> and the beauty of all this is that figuring > those things out is in a constant state of flux as well > [13:10] <@mindjuju> i've got to admit, i'm addicted to uniformity > [13:10] <@mindjuju> i like all my tables and databases lined up all > neat and organized > [13:11] <+josephscott> given the same problem today and 3 years from > now, you may solve it 3 years from now in a different way > [13:12] <xtrementl> yeah given new tech to deal with existing issues > and new issues emerging > [13:14] <+josephscott> and increase in computing power/capability > and reduction in cost for existing hardware > [13:14] <+josephscott> and changes in experience, hopefully everyone > continues to learn over the span of something like 3 years _______________________________________________ UPHPU mailing list [email protected] http://uphpu.org/mailman/listinfo/uphpu IRC: #uphpu on irc.freenode.net
