Hello,
Please advice on encoding data for the following clustering problem.
I have a dataset with car usage info. Dataset has the following fields:
1. Car model (Toyoya Celica, BMW, Nissan X-Trail, Mazda Cosmo, etc.)
2. Year built
3. Country where the car runs
4. Distance run by car before major repairs
Important: The above dataset is sparse.
In most cases "Distance" is not known for all countries for a given car.
Problem:
For a given car predict the "Distance" it will run before major repairs in a
country for which "Distance" is unknown.
My approach:
I want to represent each record in the dataset as a sparse vector with the
following components:
1. Binary (1/0) car model components. Number of these components equals the
number of all possible models in the dataset.
2. Binary (1/0) country where the car runs. Number of these components equals
the number of all possible countries in the dataset.
3. Distance. A single integer component, equals the distance run by car.
Next I want to cluster (k-means) these vectors and analyze resulting groups.
Questions:
1) In my vectors I mix components of different nature - binary (model,
country) and continuous (distance). How to calculate component-wise distance
between vectors? Cosine similarity?
2) Other ways to encode components with finite set of values (model, country)
to work well with continuous components (such as distance)?
Thanks!
Anton
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.