[ 
https://issues.apache.org/jira/browse/MAHOUT-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071011#comment-14071011
 ] 

ASF GitHub Bot commented on MAHOUT-1597:
----------------------------------------

Github user dlyubimov commented on the pull request:

    https://github.com/apache/mahout/pull/33#issuecomment-49808110
  
    @avati no i don't think so. This is similar to quick summaries of nrow, 
ncol and these needs to be known before RDD chain is constructed. 
    
    It may be viewed as an architectural problem as we don't very explicitly 
define separation between physical operators and logical (or, rather, every 
logical operator is also physical, although inverse is false). So DAG plans 
should have private[mahout] collection of properties that help logical 
rewrites. 
    
    Missing rows should be pertinent to other engines as well as we ask them to 
support DRM over HDFS ( drmLoadFromHDFS method), and in persistent form DRM may 
have missing implied rows regardless of the engine. The engine, subsequently, 
may choose to fix it eagerly or lazily -- but it doesn't change the fact that 
DRMs in Mahout historically may have missing implied rows, as coming out from 
vectorizers, there's no agreement to the contrary AFAICT.


> A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing 
> rows, Spark side
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1597
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1597
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.9
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>
> {code}
>     // Concoct an rdd with missing rows
>     val aRdd: DrmRdd[Int] = sc.parallelize(
>       0 -> dvec(1, 2, 3) ::
>           3 -> dvec(3, 4, 5) :: Nil
>     ).map { case (key, vec) => key -> (vec: Vector)}
>     val drmA = drmWrap(rdd = aRdd)
>     val controlB = inCoreA + 1.0
>     val drmB = drmA + 1.0
>     (drmB -: controlB).norm should be < 1e-10
> {code}
> should not fail.
> it was failing due to elementwise scalar operator only evaluates rows 
> actually present in dataset. 
> In case of Int-keyed row matrices, there are implied rows that yet may not be 
> present in RDD. 
> Our goal is to detect the condition and evaluate missing rows prior to 
> physical operators that don't work with missing implied rows.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to