well similarity between data should be calculated by taking care of the following variables: meteo, manifestation, day of the week, month of the year and vacation
2013/12/3 Ted Dunning <ted.dunn...@gmail.com> > The key first question is how you plan to compute similarity between data > points. It isn't clear how you should do this with your data. > > > > > On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata <angelo...@gmail.com > >wrote: > > > Hi > > > > I'm pretty newbie regarding learning achine and above all Apache Mahout, > so > > pardon me my low level questions > > > > I need to do some cluster analysis by using some data. At the beginning > > this data can be not too much huge, but after some time they can be > really > > huge (I did some calculation and after 1 year this data cann be around 37 > > billion of records) Since I have this huge data, I decided to do the > > cluster analysis by using Mahout on the top of Apache Hadoop and its > HDFS. > > Regarding where to store this big amount of data I decided to use Apache > > HBase always on the top of Apache Hadoop HDFS > > > > Now I need to do this cluster analysi by considering some environment > > variables. These variable may be the following: > > > > - *recordId* = id of the record > > - *arcId *= id of the arc between 2 points of my "street graph" > > - *mediumVelocity *= medium velocity of the considered arc in the > > specified > > - *vehiclesNumber* = number of the monitored vehicles in order to get > > that velocity > > - *meteo *= weather condition (a numeric representing if there is sun, > > rain etc...) > > - *manifestation *= a numeric representing if there is any kind of > > manifestation (sport manifestation or other) > > - *day of the week* > > - *month of the year* > > - *hour of the day* > > - *vacation *= a numeric representing if it's a vacation day or a > > working day > > > > So my data are so formatted (raw representation): > > > > *recordId arcId mediumVelocity vehiclesNumber meteo manifestation > > weekDay yearMonth dayHour vacation* > > 1 1 34.5 20 1 3 4 > > 2011 10 3 > > 2 156 66.5 3 2 5 1 > > 2008 6 2 > > > > As far as I know, in order to do the cluster analysis in Mahout I need to > > format my data in Mahout format (that is in a SequenceFile) The question > > is: how can I format my data represented as the previously written table > in > > a SequenceFile? I tried to find something but I was not able in finding > any > > good sample Any suggestion would be really appreciated > > > > Thank you Angelo > > >