Joining versus denormalization

Andrew Mon, 07 Apr 2014 21:23:29 -0700

I have a need to create a flat or denormalized data structure in Phoenix.


If the sources are 3 Phoenix tables:

A (idA, colA1, colA2, colA3)
B (idB, colB1, colB2, colB3, idA, idC)
C (idC, colC1, colC2, colC3)

and the user wants to see & write queries on

table ABC (idA|idB|idC, colA2, colA2, colA3, colB1, colB2, colB3, colC1,colC2, colC3)

where "idA|idB|idC" is a compound key of the 3 identifiers and B is thetable that has keys to join A & C to it, then it seems to me I couldapproach this two ways. ( In my case number of rows are rowcountA <rowcountB < rowcountC ).

1) create a program / Map Reduce that works through B and looks up theappropriate A and C rows and writes out a new table "ABC" which containsthe flattened data. ( I could use M/R to efficiently do the joins.)


2)
But what might performance be if I were to join them at runtime?

For certain types of join, it would seem that a call to table B forcolumns idB, colB1, colB2, colB3, idA, idC could cause a Co-processor toexecute inside a region server and pull in data dynamically from A & C.

Assuming that idA!=idB!=idB then there is reason to suppose theassociated row from A & C would be local to the Region server, so therewould be lots of network traffic to achieve this naive join,particularly compared to some other more efficient method.

Is my thinking about option (2) correct - that assuming neither A, B, Cdata fit into memory, (2) would perform poorly compared to the classicdenormalized or flattened table... it just seem so wasteful to storecolA1, colA2, colA3 again and again and again.


Thanks,

Andrew.

Joining versus denormalization

Reply via email to