Thanks for your replies, Sean and Martin. I agree that the ETL tools are complex in themselves, and I may as well spend that learning curve on a lower-level tool-set that has the added value of greater flexibility.
Can you suggest a good book or tutorial to help me build a data warehouse in python? Bill Inmon's "Building the Data Warehouse" is 17 years old, and I've been cautioned against Kimball. Thanks. On Sep 22, 3:40 pm, "Martin P. Hellwig" <martin.hell...@dcuktec.org> wrote: > snfctech wrote: > > Does anyone have experience building a data warehouse in python? Any > > thoughts on custom vs using an out-of-the-box product like Talend or > > Informatica? > > > I have an integrated system Dashboard project that I was going to > > build using cross-vendor joins on existing DBs, but I keep hearing > > that a data warehouse is the way to go. e.g. I want to create orders > > and order_items with relations to members (MS Access DB), products > > (flat file) and employees (MySQL). > > > Thanks in advance for any tips. > > My experience is that if you enjoy hacking around databases and are > proficient in Python, than for small scale solutions it is preferable to > do it yourself. If you need a large scale solutions with advanced > requirements, building it yourself is mostly the only way. > > I have build a rather complex datawarehouse system in the past (well > actually more like a centralised synchronisation hub, having input and > reporting databases as satellite clients), shoving data from around 500 > databases (typically 5 Gb in size each) spread over the world. > > The only commercial solutions I reviewed was Business Objects Data > Integrator and Oracle Warehouse Builder. > > These tools where quite flexible and if you already have a license deal > which includes these tools I would definitely recommend to have more > than just a look at it. > > If not and you are comfortably with using python to shovel data from A > to B and transform it at the same time (moving relational data > automatically into a EAV model and back again, for example) than > building your own solution will probably save you money and time (as > opposed to learn how to use that ETL tool). > > This will require you to have at least interest in the following subjects: > - Authorization, may everybody use all data or should it be limited to a > subset on the data depending on the data? > (My solution was one centralised hub which contains all data but is only > accessible to special 'client' servers strictly maintained by me which > only sync the data relevant to them). > > - Authenticity, if you have different values for the same thing, which > one should be considered authoritative and if yes may it be pushed back > to the un-authoritative? > > -Synchronisation, you really don't want to push/pull all of the database > content over every x times, so how can you delta it and is there a way > to do this only when the data changes (push vs pull)? > > -ATOMIC, how long may the data be out of date and is it allowed to > partially update > > -Using and maintaining multiple databases, hopefully spread over > multiple systems. I had a server for each production DB, a server that > mirrored that production DB with some added columns per table for > external synchronization purposes and a master synchronisation server > (so in essence all data was copied three times, not very efficient but > good if you like to play it on the safe side). > > -- > MPHhttp://blog.dcuktec.com > 'If consumed, best digested with added seasoning to own preference.' -- http://mail.python.org/mailman/listinfo/python-list