Hi All,We are using Apache Pig for building our data pipeline. We have data in 
the following fashion:
userid, age, items {code 1, code 2, ….}, few other features...
Each item has a unique alphanumeric code.  I would like to use mahout for 
clustering it.  Based on my current  reading I see following few options
1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0, AAAAA2 -> 
1, AAAAA2 ->2 etc. Then run the clustering algorithm on the reformatted data 
and then map the results back onto the real item codes.2. Represent info on 
item codes  as 1 X M matrix where a column represents an items (1 if a given 
user has viewed a particular item 0 otherwise) and will have millions of 
columns. So each user will have id, age, and this matrix. Not sure if this will 
work…..
We also want to do frequency pattern mining etc. on the same data. Any thoughts 
on data representation and clustering will be great.

                                          

Reply via email to