I would probably write a script to parse that out and stream to it from Pig.

http://pig.apache.org/docs/r0.11.0/basic.html#stream


On Mon, Dec 2, 2013 at 4:30 PM, Sameer Tilak <ssti...@live.com> wrote:

> I am looking for some input on how to vectorize my data.
>
> > From: ssti...@live.com
> > To: user@mahout.apache.org
> > Subject: Mahout for clustering
> > Date: Mon, 2 Dec 2013 16:22:03 -0800
> >
> >
> >
> >
> > Hi All,We are using Apache Pig for building our data pipeline. We have
> data in the following fashion:
> > userid, age, items {code 1, code 2, ….}, few other features...
> > Each item has a unique alphanumeric code.  I would like to use mahout
> for clustering it.  Based on my current  reading I see following few options
> > 1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0,
> AAAAA2 -> 1, AAAAA2 ->2 etc. Then run the clustering algorithm on the
> reformatted data and then map the results back onto the real item codes.2.
> Represent info on item codes  as 1 X M matrix where a column represents an
> items (1 if a given user has viewed a particular item 0 otherwise) and will
> have millions of columns. So each user will have id, age, and this matrix.
> Not sure if this will work…..
> > We also want to do frequency pattern mining etc. on the same data. Any
> thoughts on data representation and clustering will be great.
> >
> >
>
>

Reply via email to