I would probably write a script to parse that out and stream to it from Pig.
http://pig.apache.org/docs/r0.11.0/basic.html#stream On Mon, Dec 2, 2013 at 4:30 PM, Sameer Tilak <ssti...@live.com> wrote: > I am looking for some input on how to vectorize my data. > > > From: ssti...@live.com > > To: user@mahout.apache.org > > Subject: Mahout for clustering > > Date: Mon, 2 Dec 2013 16:22:03 -0800 > > > > > > > > > > Hi All,We are using Apache Pig for building our data pipeline. We have > data in the following fashion: > > userid, age, items {code 1, code 2, ….}, few other features... > > Each item has a unique alphanumeric code. I would like to use mahout > for clustering it. Based on my current reading I see following few options > > 1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0, > AAAAA2 -> 1, AAAAA2 ->2 etc. Then run the clustering algorithm on the > reformatted data and then map the results back onto the real item codes.2. > Represent info on item codes as 1 X M matrix where a column represents an > items (1 if a given user has viewed a particular item 0 otherwise) and will > have millions of columns. So each user will have id, age, and this matrix. > Not sure if this will work….. > > We also want to do frequency pattern mining etc. on the same data. Any > thoughts on data representation and clustering will be great. > > > > > >