well similarity between data should be calculated by taking care of the
following variables: meteo, manifestation, day of the week, month of the
year and vacation


2013/12/3 Ted Dunning <ted.dunn...@gmail.com>

> The key first question is how you plan to compute similarity between data
> points.  It isn't clear how you should do this with your data.
>
>
>
>
> On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata <angelo...@gmail.com
> >wrote:
>
> > Hi
> >
> > I'm pretty newbie regarding learning achine and above all Apache Mahout,
> so
> > pardon me my low level questions
> >
> > I need to do some cluster analysis by using some data. At the beginning
> > this data can be not too much huge, but after some time they can be
> really
> > huge (I did some calculation and after 1 year this data cann be around 37
> > billion of records) Since I have this huge data, I decided to do the
> > cluster analysis by using Mahout on the top of Apache Hadoop and its
> HDFS.
> > Regarding where to store this big amount of data I decided to use Apache
> > HBase always on the top of Apache Hadoop HDFS
> >
> > Now I need to do this cluster analysi by considering some environment
> > variables. These variable may be the following:
> >
> >    - *recordId* = id of the record
> >    - *arcId *= id of the arc between 2 points of my "street graph"
> >    - *mediumVelocity *= medium velocity of the considered arc in the
> >    specified
> >    - *vehiclesNumber* = number of the monitored vehicles in order to get
> >    that velocity
> >    - *meteo *= weather condition (a numeric representing if there is sun,
> >    rain etc...)
> >    - *manifestation *= a numeric representing if there is any kind of
> >    manifestation (sport manifestation or other)
> >    - *day of the week*
> >    - *month of the year*
> >    - *hour of the day*
> >    - *vacation *= a numeric representing if it's a vacation day or a
> >    working day
> >
> > So my data are so formatted (raw representation):
> >
> > *recordId arcId mediumVelocity vehiclesNumber meteo manifestation
> > weekDay yearMonth dayHour vacation*
> > 1         1      34.5            20            1      3            4
> >    2011       10      3
> > 2         156    66.5            3             2      5            1
> >    2008        6      2
> >
> > As far as I know, in order to do the cluster analysis in Mahout I need to
> > format my data in Mahout format (that is in a SequenceFile) The question
> > is: how can I format my data represented as the previously written table
> in
> > a SequenceFile? I tried to find something but I was not able in finding
> any
> > good sample Any suggestion would be really appreciated
> >
> > Thank you Angelo
> >
>

Reply via email to