[ https://issues.apache.org/jira/browse/MAHOUT-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071011#comment-14071011 ]
ASF GitHub Bot commented on MAHOUT-1597: ---------------------------------------- Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/33#issuecomment-49808110 @avati no i don't think so. This is similar to quick summaries of nrow, ncol and these needs to be known before RDD chain is constructed. It may be viewed as an architectural problem as we don't very explicitly define separation between physical operators and logical (or, rather, every logical operator is also physical, although inverse is false). So DAG plans should have private[mahout] collection of properties that help logical rewrites. Missing rows should be pertinent to other engines as well as we ask them to support DRM over HDFS ( drmLoadFromHDFS method), and in persistent form DRM may have missing implied rows regardless of the engine. The engine, subsequently, may choose to fix it eagerly or lazily -- but it doesn't change the fact that DRMs in Mahout historically may have missing implied rows, as coming out from vectorizers, there's no agreement to the contrary AFAICT. > A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing > rows, Spark side > -------------------------------------------------------------------------------------------- > > Key: MAHOUT-1597 > URL: https://issues.apache.org/jira/browse/MAHOUT-1597 > Project: Mahout > Issue Type: Bug > Affects Versions: 0.9 > Reporter: Dmitriy Lyubimov > Assignee: Dmitriy Lyubimov > Fix For: 1.0 > > > {code} > // Concoct an rdd with missing rows > val aRdd: DrmRdd[Int] = sc.parallelize( > 0 -> dvec(1, 2, 3) :: > 3 -> dvec(3, 4, 5) :: Nil > ).map { case (key, vec) => key -> (vec: Vector)} > val drmA = drmWrap(rdd = aRdd) > val controlB = inCoreA + 1.0 > val drmB = drmA + 1.0 > (drmB -: controlB).norm should be < 1e-10 > {code} > should not fail. > it was failing due to elementwise scalar operator only evaluates rows > actually present in dataset. > In case of Int-keyed row matrices, there are implied rows that yet may not be > present in RDD. > Our goal is to detect the condition and evaluate missing rows prior to > physical operators that don't work with missing implied rows. -- This message was sent by Atlassian JIRA (v6.2#6252)