[Cassandra Wiki] Trivial Update of "ThomasBoose/EERD mode l components to Cassandra Column family's" by ThomasBoose

Apache Wiki Sat, 11 Dec 2010 10:42:23 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "ThomasBoose/EERD model components to Cassandra Column family's" page has 
been changed by ThomasBoose.
http://wiki.apache.org/cassandra/ThomasBoose/EERD%20model%20components%20to%20Cassandra%20Column%20family%27s?action=diff&rev1=10&rev2=11

--------------------------------------------------

  
  If this is not yet making sence, read on.
  
+ 
+ 
- == Indexing ==
+ === Indexing ===
  In order to add an index to a column, other then the ColumnFamily key, we 
should to create a second ColumnFamily. Every insert, which can be either an 
insert or update in Cassandra, on the original ColumnFamily we will update the 
corresponding index.
  
  Think of a ColumnFamily cf_Person (examples in Python using pycassa)
@@ -25, +27 @@

  cfi_Person_City.insert ('Haarlem', {'234':''})
  }}}
  This way a hash will be created containing columns for every person's key 
that lives in a specific City. The ColumnFamily architecture of Cassandra can 
store a unlimited number of columns for each key. This meens that when deleting 
a person it's reference in the cfi_Person_City index should be removed first. 
When updating a person, maybe moving to anothor City, we have to remove the 
element from the cfi_Person_City first and then store it with the corresponding 
new City.'' ''
+ 
+ === Deleting values ===
+ Because of the way Cassandra clusters operate it is nearly impossible to 
delete values and know for sure the values are deleted on everynode. If values 
would simply be deleted and afterwards a node turns up which still holds the 
given value it would replicate ths value back to the existing nodes. Read more 
about deleting values at: DistributedDeletes
+ 
+ Cassandra on the other hand is isanely fast at inserting and updating values. 
This is why I would advise any programmer trying to build DBMS logic to 
introduce a value that "meens " deleted and have you DBMS tier respond "Does 
not exist" in both cases, when values actualy do not exist or contain the 
"deleted" value. Cassandra makes no distinction between updates and inserts so 
updating from "deleted" and inserting can be achieved the same way.
  
  == Relations ==
  === 1 on 1 ===
@@ -45, +52 @@

  ||John ||0555-123456 ||10.000 ||
  ||<style="text-align: center;" |2>321-21-4321 ||name ||phone ||salary ||
  ||Jane ||0555-654321 ||12.000 ||
- 
  
  
  ||||||<tablewidth="400px" tablestyle="text-align: left;"style="text-align: 
center;">CF_Phone ''' ''' ||
@@ -92, +98 @@

  ||Jane ||SE ||Amsterdam ||
  
  
- 
  ||||||<tablewidth="400px" tablestyle="text-align: left;"style="text-align: 
center;">CF_School_Unit ''' ''' ||
  ||<style="text-align: center;" |2>SE ||name ||loc ||
  ||software engineering ||hsl ||
  
  
- 
  ||||||<tablewidth="400px" tablestyle="text-align: left;"style="text-align: 
center;">CFK_School_Unit_Student ''' ''' ||
  ||<style="text-align: center;" |2>SE ||123-12-1234 ||321-21-4321 ||
- ||             ||             ||
+ || || ||
  
  
  
@@ -118, +122 @@

  As it is perfectly valid in a relational database to have a key value 
composed of several columns, in cassandra there is only one key per 
ColumnFamily. I did already discuss the need  to create seperate  
ColumnFamily's for one to many relationships given the fact that you can never 
tell for sure whether or not maybe in the future a new relation will popup to 
another entity sharing the same keyvalues. This means that we will need 5 
ColumnFamily's to implement the model above:
  
  CF_Order is a straight forward  ColumnFamily, Modeled after the design
- 
  ||||||||<style="text-align: center;">CF_Order ''' ''' ||
  ||<style="text-align: center;" |2>1234 ||order_date ||order_discount 
||customer_id ||
  ||20100808 ||0.4 ||1234 ||
- ||<style="text-align: center;" |2>4321 ||customer_id ||order_date ||          
   ||
+ ||<style="text-align: center;" |2>4321 ||customer_id ||order_date || ||
- ||3451 ||20100802 ||             ||
+ ||3451 ||20100802 || ||
- ||<style="text-align: center;" |2>1354 ||order_discount ||order_date ||       
      customer_id||
+ ||<style="text-align: center;" |2>1354 ||order_discount ||order_date 
||customer_id ||
- ||3 ||20100802 ||             3451||
+ ||3 ||20100802 ||3451 ||
  
  
  
  
  The same for CF_Product
- 
  ||||||||<style="text-align: center;">CF_Product ''' ''' ||
  ||<style="text-align: center;" |2>DSK_SGT_5GB_7200 ||description ||image 
||list_price ||
- ||5 GB Seagate harddisk ||             ||130 ||
+ ||5 GB Seagate harddisk || ||130 ||
  ||<style="text-align: center;" |2>KBD_LGT_WRL_350 ||description ||image 
||list_price ||
- ||Wireless keyboard ||             ||75 ||
+ ||Wireless keyboard || ||75 ||
  
  
  
  
  As we cannot itterate throu a ColumnFamily based on any attribute except its 
key and we wan't to use a solution to relations that is generic we create a 
seperate ColumnFamily for each one to many relationship. As we know that a many 
to many relationship is typicly solved by introducing 2 one to many 
relationships to a newly created ColumnFamily. So there will be one 
ColumnFamily to store which orders a product can be found on.
- 
  ||||||||<style="text-align: center;">CF_Product_Order ''' ''' ||
- ||<style="text-align: center;" |2>DSK_SGT_5GB_7200 ||1234 ||4321 ||           
  1354||
+ ||<style="text-align: center;" |2>DSK_SGT_5GB_7200 ||1234 ||4321 ||1354 ||
- ||             ||             ||             ||
+ || || || ||
- ||<style="text-align: center;" |2>KBD_LGT_WRL_350 ||4321 ||             ||    
         ||
+ ||<style="text-align: center;" |2>KBD_LGT_WRL_350 ||4321 || || ||
- ||             ||             ||             ||
+ || || || ||
  
  
  
  
  And we'll create one ColumnFamily that stores which products can be found on 
each order. We can see that this solution tends to invite anomalies. product 
that have 3 order columns but no reference to the product in the 
CF_Order_Product ColumnFamily. It up to the programmers of the DBMS tier to 
make sure genic code is available to keep the ColumnFamily's consistent al of 
the time.
- 
- ||||||||<style="text-align: center;">CF_Order_Product||
+ ||||||||<style="text-align: center;">CF_Order_Product ||
  ||<style="text-align: center;" |2>1234 ||DSK_SGT_5GB_7200 || || ||
  || || || ||
  ||<style="text-align: center;" |2>4321 ||KBD_LGT_WRL_350 ||DSK_SGT_5GB_7200 
|| ||
@@ -163, +163 @@

  || || || ||
  
  
+ 
+ 
  ''' '''Last but not least we'll introduce a ColumnFamily to store the 
attributes connected to the relationship between Product and Order. As 
mentioned it is not possible to create a key containing 2 seperate values. So 
we'll have to introduce a new value that conatenates both values using a strikt 
format. These formats should be part of a design if implementing a (E)ERD in 
Cassandra.
- 
  ||||||||<style="text-align: center;">CF_order_line''' ''' ||
- ||<style="text-align: center;" |2>1234_DSK_SGT_5GB_7200 ||number 
||delivery_date ||             ||
+ ||<style="text-align: center;" |2>1234_DSK_SGT_5GB_7200 ||number 
||delivery_date || ||
- ||5 ||             20101215||             ||
+ ||5 ||20101215 || ||
  ||<style="text-align: center;" |2>4321_KBD_LGT_WRL_350 ||product_discount 
||number ||delivery_date ||
- ||0.3 ||             1||20100901 ||
+ ||0.3 ||1 ||20100901 ||
  ||<style="text-align: center;" |2>4321_DSK_SGT_5GB_7200 ||product_discount 
||number ||delivery_date ||
- ||5 ||2||20100901 ||
+ ||5 ||2 ||20100901 ||
  ||<style="text-align: center;" |2>1354_DSK_SGT_5GB_7200 ||product_discount 
||number ||delivery_date ||
- ||5 ||8||20100901 ||
+ ||5 ||8 ||20100901 ||

[Cassandra Wiki] Trivial Update of "ThomasBoose/EERD mode l components to Cassandra Column family's" by ThomasBoose

Reply via email to