On Wed, May 6, 2009 at 6:45 AM, Grant Ingersoll <[email protected]> wrote:
>
>>
>> 2. To create canopies for 1000 documents it took almost 75 minutes.
>> Though the total number of unique terms in the index is 50,000 each
>> vector has less than 100 unique terms. (ie each document vector is a
>> sparse vector of cardinality 50,000 and 100 elements.) The hardware is
>> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
>> Hadoop has one node. Values of T1 and T2 were 80 and 55 respectively,
>> as given in the sample program.
>
> Have you profiled it? Would be good to see where the issue is coming from.
>
Apologies for reverting late.
I ran clustering on 100 documents with profile flag in hadoop set to
true. Canopy mapper took an hour and Reducer took 32 mins to generate
these results. The Canopy Clustering job is yet to finish. Here are
the relevant outputs.
Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out (Mapper)
rank self accum bytes objs bytes objs trace name
1 84.51% 84.51% 99614736 1 99614736 1 304249 byte[]
2 5.53% 90.05% 6522848 407678 3336600480 208537530 304697
java.lang.Integer
3 3.34% 93.38% 3932176 1 3932176 1 304252 int[]
4 3.03% 96.41% 3567216 222951 690373248 43148328 305480 java.lang.Integer
5 1.11% 97.52% 1310736 1 1310736 1 304250 int[]
Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out (Mapper)
rank self accum bytes objs bytes objs trace name
1 77.67% 77.67% 99614736 1 99614736 1 304245 byte[]
2 10.66% 88.33% 13676528 854783 2037966768 127372923 304840
java.lang.Integer
3 5.58% 93.91% 7158048 447378 359948080 22496755 305451 java.lang.Integer
4 3.07% 96.98% 3932176 1 3932176 1 304274 int[]
5 1.02% 98.00% 1310736 1 1310736 1 304272 int[]
Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
rank self accum bytes objs bytes objs trace name
1 10.16% 10.16% 253112 1594 1140784 6850 300008 char[]
2 9.07% 19.23% 225936 64 946288 266 300184 byte[]
3 9.06% 28.29% 225816 64 895128 232 300781 byte[]
4 2.63% 30.92% 65552 1 65552 1 302380 byte[]
5 1.97% 32.89% 49048 130 252256 700 300056 byte[]
6 1.51% 34.39% 37528 260 186896 1229 300086 char[]
Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
(Reducer)
rank self accum bytes objs bytes objs trace name
1 12.29% 12.29% 677088 42318 1811526016 113220376 306902
java.lang.Integer
2 12.25% 24.53% 674816 42176 108428384 6776774 307108 java.lang.Integer
3 11.52% 36.05% 634696 102 3574600 10233 300008 char[]
4 10.64% 46.69% 586128 24422 1804296 75179 306879
java.util.HashMap$Entry
5 7.09% 53.78% 390752 24422 4535616 283476 306878 java.lang.Double
6 7.06% 60.84% 389248 24328 4519120 282445 306880 java.lang.Integer
7 3.96% 64.80% 218224 74 359448 2939 303276 byte[]
Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out (Mapper)
rank self accum bytes objs bytes objs trace name
1 84.51% 84.51% 99614736 1 99614736 1 304249 byte[]
2 5.53% 90.05% 6522848 407678 3336600480 208537530 304697
java.lang.Integer
3 3.34% 93.38% 3932176 1 3932176 1 304252 int[]
4 3.03% 96.41% 3567216 222951 690373248 43148328 305480 java.lang.Integer
5 1.11% 97.52% 1310736 1 1310736 1 304250 int[]
Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out (Mapper)
rank self accum count trace method
1 96.85% 96.85% 347772 304838 java.lang.Object.<init>
2 0.34% 97.18% 1203 305459 java.lang.Integer.hashCode
3 0.33% 97.51% 1168 304841 java.lang.Integer.hashCode
Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
rank self accum count trace method
1 5.59% 5.59% 32 300866 java.lang.ClassLoader.findBootstrapClass
2 4.20% 9.79% 24 300859 java.util.zip.ZipFile.read
3 3.67% 13.46% 21 301341 java.util.TimeZone.getSystemTimeZoneID
4 2.45% 15.91% 14 300119 java.util.zip.ZipFile.open
5 2.45% 18.36% 14 301365 java.io.UnixFileSystem.getLength
6 2.27% 20.63% 13 300857 java.lang.ClassLoader.defineClass1
Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
(Reducer)
rank self accum count trace method
1 93.77% 93.77% 236947 304890 java.lang.Object.<init>
2 1.46% 95.23% 3693 311379 sun.nio.ch.EPollArrayWrapper.epollWait
I also took a heap dump when Mapper was running. 98% of the memory was
used by the byte arrays allocated/referenced in
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
The document vectors for input set (of 100 docs) is available here.
http://docs.google.com/Doc?id=dc5kkrf9_110fqtc63c3
I create canopies with following command.
$bin/hadoop jar ../mahout-examples-0.1.job
org.apache.mahout.clustering.canopy.CanopyClusteringJob test100
output/ org.apache.mahout.utils.EuclideanDistanceMeasure 80 55
The t1, t2 values are the ones which were given for synthetic data
example. Should the values of t1 and t2 affect the runtime
dramatically?
Thanks,
--shashi