[ https://issues.apache.org/jira/browse/MAHOUT-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631235#comment-13631235 ]
Robin Anil commented on MAHOUT-1190: ------------------------------------ Few more good updates 1) I moved the iterator logic of RASV to OpenIntDoubleHashmap, this removes the extra copy needed and iterates directly on the hashmap arrays. This is giving an extra bump of 5-10% for RASV on benchmarks. Overall the benchmarks have improved 30-55% for RASV 2) The dot product for SASV was in-efficiently implemented as well, I also noticed the dot product has some magic constants. We need to remove those as the tests on one machine is not an indicator on overall performance. Rewriting the dot product for SASV pulled it ahead of RASV by 40% (even with all the optimizations listed above). 3) The benchmark code was running at high density. Increasing the cardinality to 100K and keeping doc length at 1000 (like a text corpus), the benchmarks are looking more realistic. Benchmark running at 100K cardinality and 1K doc length. {noformat} BenchMarks DenseVector RandSparseVector SeqSparseVector Dense.fn(Rand) Dense.fn(Seq) Rand.fn(Dense) Rand.fn(Seq) Seq.fn(Dense) Seq.fn(Rand) Create (copy) nCalls = 20000; nCalls = 20000; nCalls = 20000; sum = 2.898515s; sum = 1.598739s; sum = 0.744008s; min = 0.083ms; min = 0.059ms; min = 0.027ms; max = 43.645ms; max = 14.333ms; max = 21.283ms; mean = 0.144925ms; mean = 0.079936ms; mean = 0.0372ms; stdDev = 0.95928ms; stdDev = 0.173512ms; stdDev = 0.150829ms; Speed = 6900.085 /sec Speed = 12509.859 /sec Speed = 26881.432 /sec Rate = 82.801025 MB/s Rate = 150.11832 MB/s Rate = 322.57718 MB/s DotProduct nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; sum = 2.077209s; sum = 1.117693s; sum = 0.712275s; sum = 1.932336s; sum = 3.017146s; sum = 1.910794s; sum = 1.990205s; sum = 3.039884s; sum = 0.769796s; min = 0.084ms; min = 0.038ms; min = 0.016ms; min = 0.072ms; min = 0.127ms; min = 0.078ms; min = 0.086ms; min = 0.131ms; min = 0.022ms; max = 5.707ms; max = 5.051ms; max = 0.984ms; max = 0.419ms; max = 6.065ms; max = 0.221ms; max = 0.614ms; max = 0.913ms; max = 0.31ms; mean = 0.10386ms; mean = 0.055884ms; mean = 0.035613ms; mean = 0.096616ms; mean = 0.150857ms; mean = 0.095539ms; mean = 0.09951ms; mean = 0.151994ms; mean = 0.038489ms; stdDev = 0.040833ms; stdDev = 0.055292ms; stdDev = 0.034072ms; stdDev = 0.020768ms; stdDev = 0.061155ms; stdDev = 0.005271ms; stdDev = 0.0227ms; stdDev = 0.037256ms; stdDev = 0.024026ms; Speed = 9628.304 /sec Speed = 17894.0 /sec Speed = 28079.041 /sec Speed = 10350.167 /sec Speed = 6628.781 /sec Speed = 10466.853 /sec Speed = 10049.216 /sec Speed = 6579.198 /sec Speed = 25980.91 /sec Rate = 115.53966 MB/s Rate = 214.72801 MB/s Rate = 336.94852 MB/s Rate = 124.202 MB/s Rate = 79.54537 MB/s Rate = 125.60224 MB/s Rate = 120.59059 MB/s Rate = 78.950386 MB/s Rate = 311.77094 MB/s org.apache.mahout.common.distance.CosineDistanceMeasure nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; sum = 20.332047s; sum = 10.604119s; sum = 9.226591s; sum = 15.916052s; sum = 27.26601s; sum = 15.092329s; sum = 7.213573s; sum = 29.085011s; sum = 19.878921s; min = 0.927ms; min = 0.47ms; min = 0.395ms; min = 0.725ms; min = 1.261ms; min = 0.584ms; min = 0.327ms; min = 1.341ms; min = 0.919ms; max = 4.847ms; max = 3.641ms; max = 5.204ms; max = 1.338ms; max = 6.759ms; max = 1.486ms; max = 0.677ms; max = 2.713ms; max = 53.207ms; mean = 1.016602ms; mean = 0.530205ms; mean = 0.461329ms; mean = 0.795802ms; mean = 1.3633ms; mean = 0.754616ms; mean = 0.360678ms; mean = 1.45425ms; mean = 0.993946ms; stdDev = 0.0485ms; stdDev = 0.061561ms; stdDev = 0.120737ms; stdDev = 0.032558ms; stdDev = 0.085586ms; stdDev = 0.08194ms; stdDev = 0.039636ms; stdDev = 0.081356ms; stdDev = 0.546942ms; Speed = 983.6688 /sec Speed = 1886.0596 /sec Speed = 2167.6477 /sec Speed = 1256.593 /sec Speed = 733.51404 /sec Speed = 1325.1765 /sec Speed = 2772.551 /sec Speed = 687.6394 /sec Speed = 1006.0908 /sec Rate = 11.804027 MB/s Rate = 22.632715 MB/s Rate = 26.011774 MB/s Rate = 15.079117 MB/s Rate = 8.802169 MB/s Rate = 15.902119 MB/s Rate = 33.270615 MB/s Rate = 8.251673 MB/s Rate = 12.073091 MB/s org.apache.mahout.common.distance.EuclideanDistanceMeasure nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; nCalls = 20000; sum = 20.754439s; sum = 11.289392s; sum = 9.224848s; sum = 16.26417s; sum = 27.594554s; sum = 15.53272s; sum = 7.403939s; sum = 29.404334s; sum = 19.927117s; min = 0.932ms; min = 0.51ms; min = 0.419ms; min = 0.663ms; min = 1.266ms; min = 0.571ms; min = 0.326ms; min = 1.338ms; min = 0.919ms; max = 2.279ms; max = 4.234ms; max = 14.486ms; max = 1.672ms; max = 2.89ms; max = 1.745ms; max = 0.973ms; max = 3.133ms; max = 1.858ms; mean = 1.037721ms; mean = 0.564469ms; mean = 0.461242ms; mean = 0.813208ms; mean = 1.379727ms; mean = 0.776636ms; mean = 0.370196ms; mean = 1.470216ms; mean = 0.996355ms; stdDev = 0.07615ms; stdDev = 0.063422ms; stdDev = 0.104271ms; stdDev = 0.062487ms; stdDev = 0.092051ms; stdDev = 0.106946ms; stdDev = 0.056665ms; stdDev = 0.093252ms; stdDev = 0.064435ms; Speed = 963.6493 /sec Speed = 1771.5746 /sec Speed = 2168.0574 /sec Speed = 1229.6969 /sec Speed = 724.7807 /sec Speed = 1287.6045 /sec Speed = 2701.265 /sec Speed = 680.1718 /sec Speed = 1003.6574 /sec Rate = 11.563792 MB/s Rate = 21.258896 MB/s Rate = 26.01669 MB/s Rate = 14.756363 MB/s Rate = 8.697369 MB/s Rate = 15.451254 MB/s Rate = 32.41518 MB/s Rate = 8.162063 MB/s Rate = 12.04389 MB/s {noformat} > SequentialAccessSparseVector function assignment is very slow > ------------------------------------------------------------- > > Key: MAHOUT-1190 > URL: https://issues.apache.org/jira/browse/MAHOUT-1190 > Project: Mahout > Issue Type: Bug > Reporter: Dan Filimon > Attachments: MAHOUT-1190-1.patch, MAHOUT-1190.patch > > > Currently when calling .assign() on a SASV with another vector and a custom > function, it will iterate through it and assign every single entry while also > referring it by index. > This makes the process *hugely* expensive. (on a run of BallKMeans on the 20 > newsgroups data set, profiling reveals that 92% of the runtime was spent > updating assigning the vectors). > Here's a prototype patch: > https://github.com/dfilimon/mahout/commit/63998d82bb750150a6ae09052dadf6c326c62d3d -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira