[ https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067985#comment-14067985 ]
ASF GitHub Bot commented on MAHOUT-1541: ---------------------------------------- Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/31#discussion_r15150741 --- Diff: spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala --- @@ -39,14 +40,29 @@ import org.apache.mahout.math.drm.CheckpointedDrm * to be not created when not needed. */ -case class IndexedDataset(matrix: CheckpointedDrm[Int], rowIDs: BiMap[String,Int], columnIDs: BiMap[String,Int]) { +case class IndexedDataset(var matrix: CheckpointedDrm[Int], var rowIDs: BiMap[String,Int], var columnIDs: BiMap[String,Int]) { + + // we must allow the row dimension to be adjusted in the case where the data read in is incomplete and we + // learn this afterwards + + /** + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the row cardinality value. + * No physical changes are made to the underlying drm. + * @param n number to increase row carnindality by + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable + * results. + */ + def addToRowCardinality(n: Int): Unit = { + assert(n > -1) + matrix.asInstanceOf[CheckpointedDrmSpark[Int]].addToRowCardinality(n) + } } --- End diff -- This supports an immutable CheckpointedDrm. It will create a new CheckpointedDrm with row cardinality that leaves some rows with no representation in the underlying rdd. The tests that use this seem to work but it may be an accident of the test data and this may not be a good implementation. The question is if a CheckpointedDrm, needs to be backed by an rdd with n => {} for all zero valued rows or if the absence of anything in the rdd, in other words a gap in the row key sequence, is equivalent to inserting an n = {}. > Create CLI Driver for Spark Cooccurrence Analysis > ------------------------------------------------- > > Key: MAHOUT-1541 > URL: https://issues.apache.org/jira/browse/MAHOUT-1541 > Project: Mahout > Issue Type: New Feature > Components: CLI > Reporter: Pat Ferrel > Assignee: Pat Ferrel > > Create a CLI driver to import data in a flexible manner, create an > IndexedDataset with BiMap ID translation dictionaries, call the Spark > CooccurrenceAnalysis with the appropriate params, then write output with > external IDs optionally reattached. > Ultimately it should be able to read input as the legacy mr does but will > support reading externally defined IDs and flexible formats. Output will be > of the legacy format or text files of the user's specification with > reattached Item IDs. > Support for legacy formats is a question, users can always use the legacy > code if they want this. Internal to the IndexedDataset is a Spark DRM so > pipelining can be accomplished without any writing to an actual file so the > legacy sequence file output may not be needed. > Opinions? -- This message was sent by Atlassian JIRA (v6.2#6252)