> For CF PropertyValues, instead of <property_value:customer_id> should I do > <customer_id:property_value> to preserve the same order for each > property_value ? (there will be custom null value). Whatever works best for you.
> Why is using only columns names faster ? It seems that it's not possible to > retrieve column names without column values in Hector for example, so even > after reading your article (great by the way), i don't get it. Not sure what you mean. Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 21/02/2012, at 10:17 AM, alexis coudeyras wrote: > Thanks a lot Aaron, > > I will try your idea tomorow. > > For CF PropertyValues, instead of <property_value:customer_id> should I do > <customer_id:property_value> to preserve the same order for each > property_value ? (there will be custom null value). > > Why is using only columns names faster ? It seems that it's not possible to > retrieve column names without column values in Hector for example, so even > after reading your article (great by the way), i don't get it. > > > Le 20 févr. 2012 à 20:41, aaron morton a écrit : > >> If you want to read all possible values for a field, where the field has 1 >> million possible values it's going to take time. No matter what data model >> you use. >> >> That said, the first model I would use is: >> >> CF: Customer >> Use this as a canonical record of the properties a customer has. >> row_key : <customer_id> >> cols: <property_name> = <property_value> >> >> CF: PropertyValues >> Use this to perform to build the reverse index. Column names are a composite >> value of property value and customer ID. >> row_key: <property_name> >> cols: <property_value:customer_id> = EMPTY >> >> * To Insert: It is good if you can work out the delta. Just update what you >> need to in the customer, delete the old values from the PropertyValues CF >> and insert the new ones. Note: I would insert when you get the new data, >> >> * To Read: >>> - I need to retrieve all values of a field (all firstNames, all lastNames, >> Get all the values from the appropriate row. >>> - The fastest the better (1 to 3 seconds) >> Things take time http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ >>> - It must preserve order : if i retrieve all countries and then all >>> lastName, the nth country and the nth lastName should correspond to the same >>> customer. >> Can only be guaranteed if every customer has a value for every field. Or if >> you use a custom null value. >>> - Sometimes I will have to retrieve all values of multiples fields (< >>> 10) >> There is no provision for server side joins. If you have a query you use >> often it is best to materialise the result . >> >> Hope that helps. >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 20/02/2012, at 11:49 PM, acoudeyras wrote: >> >>> Hi, >>> >>> I'm new to Cassandra and i'm looking for the best way to handle my use case. >>> >>> My entities look like : >>> >>> customers : [{ >>> id: 3F2504E0-4F89-11D3-9A0C-0305E82C3301, >>> firstName: "Carl", >>> lastName: "Smith", >>> country:"FR" >>> },{ >>> id:21EC2020-3AEA-1069-A2DD-08002B30309D, >>> firstName: "John", >>> lastName: "Doe" >>> country:"EN" >>> }] >>> >>> I will use the term "field" to describe a property of customer (lastName for >>> example). >>> >>> I will have 1 millions of customers and more than 300 fields (firstName, >>> lastName, ...) for each customer. >>> >>> I have two requirements : >>> >>> - I need to retrieve all values of a field (all firstNames, all lastNames, >>> ...). >>> - The fastest the better (1 to 3 seconds) >>> - It must preserve order : if i retrieve all countries and then all >>> lastName, the nth country and the nth lastName should correspond to the same >>> customer. >>> - Sometimes I will have to retrieve all values of multiples fields (< >>> 10) >>> >>> - Datas will be updated (insert, delete, update), every 10 or 20 minutes in >>> bulk, just a small number of entities will change each time. When an update >>> occurs, in input I have the whole entity (a full customer with all his >>> fields). Performance is important, but less than in the previous case (10 >>> seconds for updating is ok). >>> >>> - Retrieving a customer by id or retrieving a list of customer with some >>> specific criteria is *not* a requirement. >>> >>> --- >>> Solution 1: >>> >>> Column Family : customers >>> One row for each customer : 1 million rows >>> One column for each field : 300 fields by row. >>> >>> Benefits : easy to update >>> Problem : As far as i understand, it doesn't seems to fit with cassandra >>> model, getting all values will be slow. >>> >>> --- >>> Solution 2: >>> >>> Wide Row for the whole entity >>> >>> Column Family : datas >>> One row : customers >>> Composite Columns : (fieldName, ID) = fieldValue >>> >>> Customers : [{ >>> ("country", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "FR", >>> ("country", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "EN", >>> ("firstName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Carl", >>> ("firstName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "John", >>> ("lastName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Smith", >>> ("lastName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "Doe", >>> ... >>> }] >>> >>> >>> As far as i understand it seems to be the fastest way to retrieve all values >>> of a field in the same order. >>> To update, i don't need to read before writing. >>> >>> Problem : the row will be very large : 300 000 000 of columns. I can split >>> it in different rows based on the value of the specific field, for example >>> country. >>> >>> --- >>> Solution 3: >>> >>> Wide Row by field >>> >>> Column Family : customers >>> One row by field : so 300 rows >>> Columns : ID = FieldValue >>> >>> Benefits : >>> The row will be smaller, 1 000 000 colums. >>> >>> Problem : >>> Update seems more expensive, for every customer to update, i need to update >>> 300 rows. >>> >>> --- >>> >>> Witch solution seems to be the good one ? Does Cassandra is really a good >>> fit for this use case ? >>> >>> Thanks >>> >>> Alexis Coudeyras >>> >>> -- >>> View this message in context: >>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Modeling-tp7300846p7300846.html >>> Sent from the [email protected] mailing list archive at >>> Nabble.com. >> >
