[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028773#comment-14028773 ]
Ted Dunning commented on MAHOUT-1464: ------------------------------------- Should there be a dedicated colCounts function, or a more general accumulator? Basically, a row-by-row or column-by-column map-reduce aggregator is a common thing to need. This is different from the aggregateColumns we now have since what we have now doesn't requires access to the entire row. What I would be more interested in would be something like {code} Vector r = v.aggregateByRows(DoubleDoubleFunction combine, DoubleFunction map) {code} The virtue here is that iteration by rows is an efficient way to handle row-major arrangements, but iteration by column works as well: {code} for (MatrixSlice row : m) { for (int i = 0; i < columns; i++) { r.setQuick(combine.apply(r.getQuick(i), map.apply(row.getQuick(i)))); } } {code} or {code} for (MatrixSlice col: m.columnIterator()) { r.setQuick(col.index(), col.aggregate(combine, map)); } {code} These are approximate and we don't really have a columnIterator, but you can imagine how some kinds of matrix would have such a thing internally. You can also see how trivially these would be to parallelize. Arrangements which have row-wise patches of column-major data would also be easy to handle by combining these patterns. > Cooccurrence Analysis on Spark > ------------------------------ > > Key: MAHOUT-1464 > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Environment: hadoop, spark > Reporter: Pat Ferrel > Assignee: Pat Ferrel > Fix For: 1.0 > > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, > MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh > > > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that > runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM > can be used as input. > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has > several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)