I don't see anything obviously canopy-related in the logs. Canopy serializes the vectors but the storage representation should not be too inefficient.

If T1 and T2 are too small relative to your observed distance measures you will get a LOT of canopies, potentially one per document. How many did you get in your run? For 1000 vectors of 100 terms; however, it does seem that something is unusual here. I've run canopy (on a 12 node cluster) with millions of 30-element DenseVector input points and not seen these sorts of numbers. It is possible you are thrashing your RAM. Have you thought about getting an EC2 instance or two? I think we are currently ok with elastic MR too but have not tried that yet.

I would not expect the reducer to start until all the mappers are done.

I'm back stateside Wednesday from Oz and will be able to take a look later in the week. I also notice canopy still has the combiner problem we fixed in kMeans and won't work if the combiner does not run. It's darned unfortunate there isn't an option to require the combiner. More to think about...

Jeff


Shashikant Kore wrote:
On Wed, May 6, 2009 at 6:45 AM, Grant Ingersoll <[email protected]> wrote:
2. To create canopies for 1000 documents it took almost 75 minutes.
Though the total number of unique terms in the index is 50,000 each
vector has less than 100 unique terms. (ie each document vector is a
sparse vector of cardinality 50,000 and 100 elements.) The hardware is
admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
as given in the sample program.
Have you profiled it?  Would be good to see where the issue is coming from.


Apologies for reverting late.

I ran clustering on 100 documents with profile flag in hadoop set to
true. Canopy mapper took an hour and Reducer took 32 mins to generate
these results.  The Canopy Clustering job is yet to finish. Here are
the relevant outputs.

Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)
rank   self  accum     bytes objs     bytes  objs trace name
    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
java.lang.Integer
    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]

Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out (Mapper)
rank   self  accum     bytes objs     bytes  objs trace name
    1 77.67% 77.67%  99614736    1  99614736     1 304245 byte[]
    2 10.66% 88.33%  13676528 854783 2037966768 127372923 304840
java.lang.Integer
    3  5.58% 93.91%   7158048 447378 359948080 22496755 305451 java.lang.Integer
    4  3.07% 96.98%   3932176    1   3932176     1 304274 int[]
    5  1.02% 98.00%   1310736    1   1310736     1 304272 int[]


Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
rank   self  accum     bytes objs     bytes  objs trace name
    1 10.16% 10.16%    253112 1594   1140784  6850 300008 char[]
    2  9.07% 19.23%    225936   64    946288   266 300184 byte[]
    3  9.06% 28.29%    225816   64    895128   232 300781 byte[]
    4  2.63% 30.92%     65552    1     65552     1 302380 byte[]
    5  1.97% 32.89%     49048  130    252256   700 300056 byte[]
    6  1.51% 34.39%     37528  260    186896  1229 300086 char[]


Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
 (Reducer)
 rank   self  accum     bytes objs     bytes  objs trace name
    1 12.29% 12.29%    677088 42318 1811526016 113220376 306902
java.lang.Integer
    2 12.25% 24.53%    674816 42176 108428384 6776774 307108 java.lang.Integer
    3 11.52% 36.05%    634696  102   3574600 10233 300008 char[]
    4 10.64% 46.69%    586128 24422   1804296 75179 306879
java.util.HashMap$Entry
    5  7.09% 53.78%    390752 24422   4535616 283476 306878 java.lang.Double
    6  7.06% 60.84%    389248 24328   4519120 282445 306880 java.lang.Integer
    7  3.96% 64.80%    218224   74    359448  2939 303276 byte[]



Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)

rank   self  accum     bytes objs     bytes  objs trace name
    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
java.lang.Integer
    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]

Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out  (Mapper)
rank   self  accum   count trace method
   1 96.85% 96.85%  347772 304838 java.lang.Object.<init>
   2  0.34% 97.18%    1203 305459 java.lang.Integer.hashCode
   3  0.33% 97.51%    1168 304841 java.lang.Integer.hashCode

Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
rank   self  accum   count trace method
   1  5.59%  5.59%      32 300866 java.lang.ClassLoader.findBootstrapClass
   2  4.20%  9.79%      24 300859 java.util.zip.ZipFile.read
   3  3.67% 13.46%      21 301341 java.util.TimeZone.getSystemTimeZoneID
   4  2.45% 15.91%      14 300119 java.util.zip.ZipFile.open
   5  2.45% 18.36%      14 301365 java.io.UnixFileSystem.getLength
   6  2.27% 20.63%      13 300857 java.lang.ClassLoader.defineClass1


Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
 (Reducer)
rank   self  accum   count trace method
   1 93.77% 93.77%  236947 304890 java.lang.Object.<init>
   2  1.46% 95.23%    3693 311379 sun.nio.ch.EPollArrayWrapper.epollWait


I also took a heap dump when Mapper was running. 98% of the memory was
used by the byte arrays allocated/referenced in
org.apache.hadoop.mapred.MapTask$MapOutputBuffer

The document vectors for input set (of 100 docs) is available here.
http://docs.google.com/Doc?id=dc5kkrf9_110fqtc63c3

I create canopies with following command.

$bin/hadoop jar ../mahout-examples-0.1.job
org.apache.mahout.clustering.canopy.CanopyClusteringJob test100
output/ org.apache.mahout.utils.EuclideanDistanceMeasure 80 55

The t1, t2 values are the ones which were given for synthetic data
example. Should the values of t1 and t2 affect the runtime
dramatically?

Thanks,

--shashi



Reply via email to