Hi hadoop users,

I am a beginner and trying to write a M/R job to compute similarity between
a set of vectors. To start with I have a set of vectors stored in .txt
files on HDFS.

   1. We generate cartesian product pairs of vectors by a python script and
   give it as a input to MR jobs.
   2. Then mapper tasks gets (Path1, Path2) and computes similarity.* [1]*
      1. To do this Mapper reads file at Path1 using HDFS API, reads File
      at Path2 using HDFS API. So, each file is read many many times due to the
      pairwise calculation.
      2. As per our analysis major time of mapper is consumed in IO (2.5
      secs for 4 file reads)

So as of now we are not able to process large number of vectors due to
bottleneck on IO.

>From what I have read online *[2] *seems like we should look to create a
sequence file and use it in MR job instead of passing paths. This seems
like a good solution but I am looking for a quick fix as of now.

Can someone suggest what we can do to reduce IO time in mapper jobs?


*[1]*
https://github.com/smadha/pooled_time_series/blob/master/src/main/java/gov/nasa/jpl/memex/pooledtimeseries/MeanChiSquareDistanceCalculation.java#L69
*[2]* http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

--
Madhav Sharan

Reply via email to