Thanks a lot Aaron, I will try your idea tomorow.
For CF PropertyValues, instead of <property_value:customer_id> should I do <customer_id:property_value> to preserve the same order for each property_value ? (there will be custom null value). Why is using only columns names faster ? It seems that it's not possible to retrieve column names without column values in Hector for example, so even after reading your article (great by the way), i don't get it. Le 20 févr. 2012 à 20:41, aaron morton a écrit : > If you want to read all possible values for a field, where the field has 1 > million possible values it's going to take time. No matter what data model > you use. > > That said, the first model I would use is: > > CF: Customer > Use this as a canonical record of the properties a customer has. > row_key : <customer_id> > cols: <property_name> = <property_value> > > CF: PropertyValues > Use this to perform to build the reverse index. Column names are a composite > value of property value and customer ID. > row_key: <property_name> > cols: <property_value:customer_id> = EMPTY > > * To Insert: It is good if you can work out the delta. Just update what you > need to in the customer, delete the old values from the PropertyValues CF and > insert the new ones. Note: I would insert when you get the new data, > > * To Read: >> - I need to retrieve all values of a field (all firstNames, all lastNames, > Get all the values from the appropriate row. >> - The fastest the better (1 to 3 seconds) > Things take time http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ >> - It must preserve order : if i retrieve all countries and then all >> lastName, the nth country and the nth lastName should correspond to the same >> customer. > Can only be guaranteed if every customer has a value for every field. Or if > you use a custom null value. >> - Sometimes I will have to retrieve all values of multiples fields (< >> 10) > There is no provision for server side joins. If you have a query you use > often it is best to materialise the result . > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 20/02/2012, at 11:49 PM, acoudeyras wrote: > >> Hi, >> >> I'm new to Cassandra and i'm looking for the best way to handle my use case. >> >> My entities look like : >> >> customers : [{ >> id: 3F2504E0-4F89-11D3-9A0C-0305E82C3301, >> firstName: "Carl", >> lastName: "Smith", >> country:"FR" >> },{ >> id:21EC2020-3AEA-1069-A2DD-08002B30309D, >> firstName: "John", >> lastName: "Doe" >> country:"EN" >> }] >> >> I will use the term "field" to describe a property of customer (lastName for >> example). >> >> I will have 1 millions of customers and more than 300 fields (firstName, >> lastName, ...) for each customer. >> >> I have two requirements : >> >> - I need to retrieve all values of a field (all firstNames, all lastNames, >> ...). >> - The fastest the better (1 to 3 seconds) >> - It must preserve order : if i retrieve all countries and then all >> lastName, the nth country and the nth lastName should correspond to the same >> customer. >> - Sometimes I will have to retrieve all values of multiples fields (< >> 10) >> >> - Datas will be updated (insert, delete, update), every 10 or 20 minutes in >> bulk, just a small number of entities will change each time. When an update >> occurs, in input I have the whole entity (a full customer with all his >> fields). Performance is important, but less than in the previous case (10 >> seconds for updating is ok). >> >> - Retrieving a customer by id or retrieving a list of customer with some >> specific criteria is *not* a requirement. >> >> --- >> Solution 1: >> >> Column Family : customers >> One row for each customer : 1 million rows >> One column for each field : 300 fields by row. >> >> Benefits : easy to update >> Problem : As far as i understand, it doesn't seems to fit with cassandra >> model, getting all values will be slow. >> >> --- >> Solution 2: >> >> Wide Row for the whole entity >> >> Column Family : datas >> One row : customers >> Composite Columns : (fieldName, ID) = fieldValue >> >> Customers : [{ >> ("country", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "FR", >> ("country", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "EN", >> ("firstName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Carl", >> ("firstName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "John", >> ("lastName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Smith", >> ("lastName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "Doe", >> ... >> }] >> >> >> As far as i understand it seems to be the fastest way to retrieve all values >> of a field in the same order. >> To update, i don't need to read before writing. >> >> Problem : the row will be very large : 300 000 000 of columns. I can split >> it in different rows based on the value of the specific field, for example >> country. >> >> --- >> Solution 3: >> >> Wide Row by field >> >> Column Family : customers >> One row by field : so 300 rows >> Columns : ID = FieldValue >> >> Benefits : >> The row will be smaller, 1 000 000 colums. >> >> Problem : >> Update seems more expensive, for every customer to update, i need to update >> 300 rows. >> >> --- >> >> Witch solution seems to be the good one ? Does Cassandra is really a good >> fit for this use case ? >> >> Thanks >> >> Alexis Coudeyras >> >> -- >> View this message in context: >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Modeling-tp7300846p7300846.html >> Sent from the [email protected] mailing list archive at >> Nabble.com. >
