Re: Data Modeling

alexis coudeyras Mon, 20 Feb 2012 13:18:18 -0800

Thanks a lot Aaron,

I will try your idea tomorow.


For CF PropertyValues, instead of <property_value:customer_id> should I do 
<customer_id:property_value> to preserve the same order for each property_value 
? (there will be custom null value).

Why is using only columns names faster ? It seems that it's not possible to 
retrieve column names without column values in Hector for example, so even 
after reading your article (great by the way), i don't get it.


Le 20 févr. 2012 à 20:41, aaron morton a écrit :

> If you want to read all possible values for a field, where the field has 1 
> million possible values it's going to take time. No matter what data model 
> you use. 
> 
> That said, the first model I would use is:
> 
> CF: Customer
> Use this as a canonical record of the properties a customer has. 
> row_key : <customer_id>
> cols: <property_name> = <property_value>
> 
> CF: PropertyValues
> Use this to perform to build the reverse index. Column names are a composite 
> value of property value and customer ID.
> row_key: <property_name>
> cols: <property_value:customer_id> = EMPTY
> 
> * To Insert: It is good if you can work out the delta. Just update what you 
> need to in the customer, delete the old values from the PropertyValues CF and 
> insert the new ones. Note: I would insert when you get the new data, 
> 
> * To Read:
>>   - I need to retrieve all values of a field (all firstNames, all lastNames,
> Get all the values from the appropriate row. 
>>      - The fastest the better (1 to 3 seconds)
> Things take time http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
>>      - It must preserve order : if i retrieve all countries and then all
>> lastName, the nth country and the nth lastName should correspond to the same
>> customer.
> Can only be guaranteed if every customer has a value for every field. Or if 
> you use a custom null value. 
>>      - Sometimes I will have to retrieve all values of multiples fields (< 
>> 10)
> There is no provision for server side joins. If you have a query you use 
> often it is best to materialise the result .
> 
> Hope that helps. 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/02/2012, at 11:49 PM, acoudeyras wrote:
> 
>> Hi,
>> 
>> I'm new to Cassandra and i'm looking for the best way to handle my use case.
>> 
>> My entities look like :
>> 
>> customers : [{
>>      id: 3F2504E0-4F89-11D3-9A0C-0305E82C3301,
>>      firstName: "Carl",
>>      lastName: "Smith",
>>      country:"FR"
>> },{
>>      id:21EC2020-3AEA-1069-A2DD-08002B30309D,
>>      firstName: "John",
>>      lastName: "Doe"
>>      country:"EN"
>> }]
>> 
>> I will use the term "field" to describe a property of customer (lastName for
>> example).
>> 
>> I will have 1 millions of customers and more than 300 fields (firstName,
>> lastName, ...) for each customer.
>> 
>> I have two requirements :
>> 
>> - I need to retrieve all values of a field (all firstNames, all lastNames,
>> ...).
>>      - The fastest the better (1 to 3 seconds)
>>      - It must preserve order : if i retrieve all countries and then all
>> lastName, the nth country and the nth lastName should correspond to the same
>> customer.
>>      - Sometimes I will have to retrieve all values of multiples fields (< 
>> 10)
>> 
>> - Datas will be updated (insert, delete, update), every 10 or 20 minutes in
>> bulk, just a small number of entities will change each time. When an update
>> occurs, in input I have the whole entity (a full customer with all his
>> fields). Performance is important, but less than in the previous case (10
>> seconds for updating is ok).
>> 
>> - Retrieving a customer by id or retrieving a list of customer with some
>> specific criteria is *not* a requirement.
>> 
>> ---
>> Solution 1:
>> 
>> Column Family : customers
>> One row for each customer : 1 million rows
>> One column for each field : 300 fields by row.
>> 
>> Benefits : easy to update
>> Problem : As far as i understand, it doesn't seems to fit with cassandra
>> model, getting all values will be slow.
>> 
>> ---
>> Solution 2:
>> 
>> Wide Row for the whole entity
>> 
>> Column Family : datas
>> One row : customers
>> Composite Columns : (fieldName, ID) = fieldValue
>> 
>> Customers : [{
>>      ("country", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "FR",
>>      ("country", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "EN",
>>      ("firstName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Carl",
>>      ("firstName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "John",
>>      ("lastName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Smith",
>>      ("lastName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "Doe",
>> ...
>> }]
>> 
>> 
>> As far as i understand it seems to be the fastest way to retrieve all values
>> of a field in the same order.
>> To update, i don't need to read before writing.
>> 
>> Problem : the row will be very large : 300 000 000 of columns. I can split
>> it in different rows based on the value of the specific field, for example
>> country.
>> 
>> ---
>> Solution 3:
>> 
>> Wide Row by field 
>> 
>> Column Family : customers
>> One row by field : so 300 rows
>> Columns : ID = FieldValue
>> 
>> Benefits :
>> The row will be smaller, 1 000 000 colums.
>> 
>> Problem :
>> Update seems more expensive, for every customer to update, i need to update
>> 300 rows.
>> 
>> ---
>> 
>> Witch solution seems to be the good one ? Does Cassandra is really a good
>> fit for this use case ?
>> 
>> Thanks
>> 
>> Alexis Coudeyras
>> 
>> --
>> View this message in context: 
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Modeling-tp7300846p7300846.html
>> Sent from the [email protected] mailing list archive at 
>> Nabble.com.
>

Re: Data Modeling

Reply via email to