Re: Data Modeling

aaron morton Tue, 21 Feb 2012 00:08:42 -0800

> For CF PropertyValues, instead of <property_value:customer_id> should I do 
> <customer_id:property_value> to preserve the same order for each 
> property_value ? (there will be custom null value).
Whatever works best for you.


> Why is using only columns names faster ? It seems that it's not possible to 
> retrieve column names without column values in Hector for example, so even 
> after reading your article (great by the way), i don't get it.

Not sure what you mean. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 21/02/2012, at 10:17 AM, alexis coudeyras wrote:

> Thanks a lot Aaron,
> 
> I will try your idea tomorow.
> 
> For CF PropertyValues, instead of <property_value:customer_id> should I do 
> <customer_id:property_value> to preserve the same order for each 
> property_value ? (there will be custom null value).
> 
> Why is using only columns names faster ? It seems that it's not possible to 
> retrieve column names without column values in Hector for example, so even 
> after reading your article (great by the way), i don't get it.
> 
> 
> Le 20 févr. 2012 à 20:41, aaron morton a écrit :
> 
>> If you want to read all possible values for a field, where the field has 1 
>> million possible values it's going to take time. No matter what data model 
>> you use. 
>> 
>> That said, the first model I would use is:
>> 
>> CF: Customer
>> Use this as a canonical record of the properties a customer has. 
>> row_key : <customer_id>
>> cols: <property_name> = <property_value>
>> 
>> CF: PropertyValues
>> Use this to perform to build the reverse index. Column names are a composite 
>> value of property value and customer ID.
>> row_key: <property_name>
>> cols: <property_value:customer_id> = EMPTY
>> 
>> * To Insert: It is good if you can work out the delta. Just update what you 
>> need to in the customer, delete the old values from the PropertyValues CF 
>> and insert the new ones. Note: I would insert when you get the new data, 
>> 
>> * To Read:
>>>   - I need to retrieve all values of a field (all firstNames, all lastNames,
>> Get all the values from the appropriate row. 
>>>     - The fastest the better (1 to 3 seconds)
>> Things take time http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
>>>     - It must preserve order : if i retrieve all countries and then all
>>> lastName, the nth country and the nth lastName should correspond to the same
>>> customer.
>> Can only be guaranteed if every customer has a value for every field. Or if 
>> you use a custom null value. 
>>>     - Sometimes I will have to retrieve all values of multiples fields (< 
>>> 10)
>> There is no provision for server side joins. If you have a query you use 
>> often it is best to materialise the result .
>> 
>> Hope that helps. 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 20/02/2012, at 11:49 PM, acoudeyras wrote:
>> 
>>> Hi,
>>> 
>>> I'm new to Cassandra and i'm looking for the best way to handle my use case.
>>> 
>>> My entities look like :
>>> 
>>> customers : [{
>>>     id: 3F2504E0-4F89-11D3-9A0C-0305E82C3301,
>>>     firstName: "Carl",
>>>     lastName: "Smith",
>>>     country:"FR"
>>> },{
>>>     id:21EC2020-3AEA-1069-A2DD-08002B30309D,
>>>     firstName: "John",
>>>     lastName: "Doe"
>>>     country:"EN"
>>> }]
>>> 
>>> I will use the term "field" to describe a property of customer (lastName for
>>> example).
>>> 
>>> I will have 1 millions of customers and more than 300 fields (firstName,
>>> lastName, ...) for each customer.
>>> 
>>> I have two requirements :
>>> 
>>> - I need to retrieve all values of a field (all firstNames, all lastNames,
>>> ...).
>>>     - The fastest the better (1 to 3 seconds)
>>>     - It must preserve order : if i retrieve all countries and then all
>>> lastName, the nth country and the nth lastName should correspond to the same
>>> customer.
>>>     - Sometimes I will have to retrieve all values of multiples fields (< 
>>> 10)
>>> 
>>> - Datas will be updated (insert, delete, update), every 10 or 20 minutes in
>>> bulk, just a small number of entities will change each time. When an update
>>> occurs, in input I have the whole entity (a full customer with all his
>>> fields). Performance is important, but less than in the previous case (10
>>> seconds for updating is ok).
>>> 
>>> - Retrieving a customer by id or retrieving a list of customer with some
>>> specific criteria is *not* a requirement.
>>> 
>>> ---
>>> Solution 1:
>>> 
>>> Column Family : customers
>>> One row for each customer : 1 million rows
>>> One column for each field : 300 fields by row.
>>> 
>>> Benefits : easy to update
>>> Problem : As far as i understand, it doesn't seems to fit with cassandra
>>> model, getting all values will be slow.
>>> 
>>> ---
>>> Solution 2:
>>> 
>>> Wide Row for the whole entity
>>> 
>>> Column Family : datas
>>> One row : customers
>>> Composite Columns : (fieldName, ID) = fieldValue
>>> 
>>> Customers : [{
>>>     ("country", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "FR",
>>>     ("country", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "EN",
>>>     ("firstName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Carl",
>>>     ("firstName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "John",
>>>     ("lastName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Smith",
>>>     ("lastName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "Doe",
>>> ...
>>> }]
>>> 
>>> 
>>> As far as i understand it seems to be the fastest way to retrieve all values
>>> of a field in the same order.
>>> To update, i don't need to read before writing.
>>> 
>>> Problem : the row will be very large : 300 000 000 of columns. I can split
>>> it in different rows based on the value of the specific field, for example
>>> country.
>>> 
>>> ---
>>> Solution 3:
>>> 
>>> Wide Row by field 
>>> 
>>> Column Family : customers
>>> One row by field : so 300 rows
>>> Columns : ID = FieldValue
>>> 
>>> Benefits :
>>> The row will be smaller, 1 000 000 colums.
>>> 
>>> Problem :
>>> Update seems more expensive, for every customer to update, i need to update
>>> 300 rows.
>>> 
>>> ---
>>> 
>>> Witch solution seems to be the good one ? Does Cassandra is really a good
>>> fit for this use case ?
>>> 
>>> Thanks
>>> 
>>> Alexis Coudeyras
>>> 
>>> --
>>> View this message in context: 
>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Modeling-tp7300846p7300846.html
>>> Sent from the [email protected] mailing list archive at 
>>> Nabble.com.
>> 
>

Re: Data Modeling

Reply via email to