Re: Write SequenceFile from custom data

2013-12-04 Thread Angelo Immediata
I was thinking to use org.apache.hadoop.mapred.join.TupleWritable in order
to realize my clustering..according to you,...is this a right choice?
Otherwise...how may I implement my scenario?

Thank you
Angelo


2013/12/3 Angelo Immediata angelo...@gmail.com

 well similarity between data should be calculated by taking care of the
 following variables: meteo, manifestation, day of the week, month of the
 year and vacation


 2013/12/3 Ted Dunning ted.dunn...@gmail.com

 The key first question is how you plan to compute similarity between data
 points.  It isn't clear how you should do this with your data.




 On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata angelo...@gmail.com
 wrote:

  Hi
 
  I'm pretty newbie regarding learning achine and above all Apache
 Mahout, so
  pardon me my low level questions
 
  I need to do some cluster analysis by using some data. At the beginning
  this data can be not too much huge, but after some time they can be
 really
  huge (I did some calculation and after 1 year this data cann be around
 37
  billion of records) Since I have this huge data, I decided to do the
  cluster analysis by using Mahout on the top of Apache Hadoop and its
 HDFS.
  Regarding where to store this big amount of data I decided to use Apache
  HBase always on the top of Apache Hadoop HDFS
 
  Now I need to do this cluster analysi by considering some environment
  variables. These variable may be the following:
 
 - *recordId* = id of the record
 - *arcId *= id of the arc between 2 points of my street graph
 - *mediumVelocity *= medium velocity of the considered arc in the
 specified
 - *vehiclesNumber* = number of the monitored vehicles in order to get
 that velocity
 - *meteo *= weather condition (a numeric representing if there is
 sun,
 rain etc...)
 - *manifestation *= a numeric representing if there is any kind of
 manifestation (sport manifestation or other)
 - *day of the week*
 - *month of the year*
 - *hour of the day*
 - *vacation *= a numeric representing if it's a vacation day or a
 working day
 
  So my data are so formatted (raw representation):
 
  *recordId arcId mediumVelocity vehiclesNumber meteo manifestation
  weekDay yearMonth dayHour vacation*
  1 1  34.5201  34
 2011   10  3
  2 15666.53 2  51
 20086  2
 
  As far as I know, in order to do the cluster analysis in Mahout I need
 to
  format my data in Mahout format (that is in a SequenceFile) The question
  is: how can I format my data represented as the previously written
 table in
  a SequenceFile? I tried to find something but I was not able in finding
 any
  good sample Any suggestion would be really appreciated
 
  Thank you Angelo
 





Re: Write SequenceFile from custom data

2013-12-02 Thread Angelo Immediata
well similarity between data should be calculated by taking care of the
following variables: meteo, manifestation, day of the week, month of the
year and vacation


2013/12/3 Ted Dunning ted.dunn...@gmail.com

 The key first question is how you plan to compute similarity between data
 points.  It isn't clear how you should do this with your data.




 On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata angelo...@gmail.com
 wrote:

  Hi
 
  I'm pretty newbie regarding learning achine and above all Apache Mahout,
 so
  pardon me my low level questions
 
  I need to do some cluster analysis by using some data. At the beginning
  this data can be not too much huge, but after some time they can be
 really
  huge (I did some calculation and after 1 year this data cann be around 37
  billion of records) Since I have this huge data, I decided to do the
  cluster analysis by using Mahout on the top of Apache Hadoop and its
 HDFS.
  Regarding where to store this big amount of data I decided to use Apache
  HBase always on the top of Apache Hadoop HDFS
 
  Now I need to do this cluster analysi by considering some environment
  variables. These variable may be the following:
 
 - *recordId* = id of the record
 - *arcId *= id of the arc between 2 points of my street graph
 - *mediumVelocity *= medium velocity of the considered arc in the
 specified
 - *vehiclesNumber* = number of the monitored vehicles in order to get
 that velocity
 - *meteo *= weather condition (a numeric representing if there is sun,
 rain etc...)
 - *manifestation *= a numeric representing if there is any kind of
 manifestation (sport manifestation or other)
 - *day of the week*
 - *month of the year*
 - *hour of the day*
 - *vacation *= a numeric representing if it's a vacation day or a
 working day
 
  So my data are so formatted (raw representation):
 
  *recordId arcId mediumVelocity vehiclesNumber meteo manifestation
  weekDay yearMonth dayHour vacation*
  1 1  34.5201  34
 2011   10  3
  2 15666.53 2  51
 20086  2
 
  As far as I know, in order to do the cluster analysis in Mahout I need to
  format my data in Mahout format (that is in a SequenceFile) The question
  is: how can I format my data represented as the previously written table
 in
  a SequenceFile? I tried to find something but I was not able in finding
 any
  good sample Any suggestion would be really appreciated
 
  Thank you Angelo