[ https://issues.apache.org/jira/browse/MAHOUT-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robin Anil updated MAHOUT-1190: ------------------------------- Attachment: MAHOUT-1190.patch As it turns out the Non Default iterator on the RandomAccessSparseVector was looking up hashmap for values. By iterating and copying a parallel array, I was able to speed up all operations of RandomAccessSparseVector by 25-45% on dot products and distance measures. Its almost twice as fast as SASV on dot product even after applying Dan's Patch > SequentialAccessSparseVector function assignment is very slow > ------------------------------------------------------------- > > Key: MAHOUT-1190 > URL: https://issues.apache.org/jira/browse/MAHOUT-1190 > Project: Mahout > Issue Type: Bug > Reporter: Dan Filimon > Attachments: MAHOUT-1190.patch > > > Currently when calling .assign() on a SASV with another vector and a custom > function, it will iterate through it and assign every single entry while also > referring it by index. > This makes the process *hugely* expensive. (on a run of BallKMeans on the 20 > newsgroups data set, profiling reveals that 92% of the runtime was spent > updating assigning the vectors). > Here's a prototype patch: > https://github.com/dfilimon/mahout/commit/63998d82bb750150a6ae09052dadf6c326c62d3d -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira