I have a need to create a flat or denormalized data structure in Phoenix.
If the sources are 3 Phoenix tables:
A (idA, colA1, colA2, colA3)
B (idB, colB1, colB2, colB3, idA, idC)
C (idC, colC1, colC2, colC3)
and the user wants to see & write queries on
table ABC (idA|idB|idC, colA2, colA2, colA3, colB1, colB2, colB3, colC1,
colC2, colC3)
where "idA|idB|idC" is a compound key of the 3 identifiers and B is the
table that has keys to join A & C to it, then it seems to me I could
approach this two ways. ( In my case number of rows are rowcountA <
rowcountB < rowcountC ).
1) create a program / Map Reduce that works through B and looks up the
appropriate A and C rows and writes out a new table "ABC" which contains
the flattened data. ( I could use M/R to efficiently do the joins.)
2)
But what might performance be if I were to join them at runtime?
For certain types of join, it would seem that a call to table B for
columns idB, colB1, colB2, colB3, idA, idC could cause a Co-processor to
execute inside a region server and pull in data dynamically from A & C.
Assuming that idA!=idB!=idB then there is reason to suppose the
associated row from A & C would be local to the Region server, so there
would be lots of network traffic to achieve this naive join,
particularly compared to some other more efficient method.
Is my thinking about option (2) correct - that assuming neither A, B, C
data fit into memory, (2) would perform poorly compared to the classic
denormalized or flattened table... it just seem so wasteful to store
colA1, colA2, colA3 again and again and again.
Thanks,
Andrew.