[rhino-tools-dev] Re: ETL: how to handle foreign key relationships when loading data?

Seth Petry-Johnson Thu, 14 Oct 2010 06:25:44 -0700

Miles,

Thanks for the reply. I was already thinking about ways of caching the
generated IDs and/or inserting to staging tables and then using sprocs
(or a second stage ETL process) to push to the live tables, although I
really like your idea of using identity insert with IDs calculated on
the client.


My only concern would be that someone might come along and use the
application to insert a person while my ETL process is running,
causing one of my inserts to fail. I guess I could trap that exception
and update my internal ID counter or something...

My last question is regarding your statement "and then use the output
of the 'person' pipeline as input to a join in the 'person_phone'
pipeline".  I thought joins were for taking two rows with different
columns and joining them into a merged row with all of the columns. Is
there an example anywhere of using joins to represent parent/child,
one-to-many relationships?

Thanks again!

Seth

On Oct 14, 5:14 am, Miles Waller <[email protected]> wrote:
> Hi,
>
> I'm pretty sure there isn't built-in support as the keys are determined on
> the server, not in rhino.etl, and if it's a bulk upload then there isn't an
> easy way to get them short of re-querying.
>
> The easiest way of doing this that I've found if you know the keys are
> unique in the input data and you can set identity_insert on in the database,
> is just use (starting_point - original_keyvalue) as the key instead,
> choosing starting_point yourself to avoid overlaps with any other data
> you've imported.
>
> If you keep track of the keys yourself, you end up with lots of "lookups"
> which make the pipelines look more convoluted than they need to be.  In this
> case, where I have lots of lookups but not much else, I tend to create
> staging tables into which I load the cleaned + transformed data, then use
> SQL to match the new primary keys and move the data into the production
> tables.
>
>  Otherwise, you're keeping track of them yourself.  In my opinion, it's
> easier if you can assign the keys yourself (and have then use the output of
> the "person" pipeline as input to a join in the "person_phone" pipeline)
> rather than having the database do them, as this reduces the number of
> steps.  If your input files are sorted in the same order, then less data
> will be held in memory during the pipeline.
>
> Hope that helps,
>
> Miles
>
> On Wed, Oct 13, 2010 at 10:09 PM, Seth Petry-Johnson <[email protected]
>
>
>
> > wrote:
> > I'm evaluating Rhino ETL and I understand the basics, but I'm confused
> > as to how to handle foreign key relationships when loading data.
>
> > For example, my source DB has "person" and "person_phone" tables,
> > where each person can have 0 or more phone number records. In my
> > destination DB, PERSON has an autoincrementing PK. How do I structure
> > my Process and Operations such that I can link child Phone rows to the
> > parent Person?
>
> > Is this built into the ETL library, or do I need to manually keep
> > track of these IDs?
>
> > Thanks!
>
> > Seth
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Rhino Tools Dev" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to
> > [email protected]<rhino-tools-dev%2Bunsubscribe@ 
> > googlegroups.com>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/rhino-tools-dev?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Rhino Tools Dev" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rhino-tools-dev?hl=en.

[rhino-tools-dev] Re: ETL: how to handle foreign key relationships when loading data?

Reply via email to