[ https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004037#comment-14004037 ]
Dmitriy Lyubimov commented on MAHOUT-1490: ------------------------------------------ if inflate/deflate cycle is needed to update an element, i take it it changes entire backing chunk representation, doesn't it. That's what i mean by immutable, scala immutable apis are then mutable in the same sense (only by applying a functor). It is important because data frame blockwise transformations may be updating its content (vector chunks) at random coordinates. Obviously inflate-deflate cycle for _each_ update makes it incredibly inefficient. You seem to imply the cycle of inflate -> do all local task updates -> deflate again. this is far from a general algorithm pattern of random elementwise gets and sets (in-core operations with getQuick() and setQuick() in Mahout's sense). it also has further profound distributed plan implications (determine boundaries of single map() fusion operation in order to avoid inflate-deflate cycle between fused functors and monads etc). So the bottom line that i am driving at is if we consider a generic algorithm that does random sequence of element reads and writes, it can't really trivially capitalize on reading speed because essentially it would have to start working on inflated representation at potentially first random write. The only thing that more or less works is read-only access, and, for most part, sequential readonly access, which shines at compiling condensed summaries, but that's about it. In that, it seems to be awfully similar to SequentialAccessSparseVector (as opposed to RandomAccessSparseVector) in Mahout. Sequential access usually implies a functor or monad, i.e. immutability of the source, and sequential result construction. This is the happiest part for this approach and also is incredibly common. Nonsequential access may imply both in-place (i.e. writes to source) and non-inplace random writes (i.e. writes to a separate output). This happens in some cases as well. The challenge here is to find a balance, or somehow expose costs, of sequential access vs. random access vs. writes to the client algorithm for a particular backing implementation. > Data frame R-like bindings > -------------------------- > > Key: MAHOUT-1490 > URL: https://issues.apache.org/jira/browse/MAHOUT-1490 > Project: Mahout > Issue Type: New Feature > Reporter: Saikat Kanjilal > Assignee: Dmitriy Lyubimov > Fix For: 1.0 > > Original Estimate: 20h > Remaining Estimate: 20h > > Create Data frame R-like bindings for spark -- This message was sent by Atlassian JIRA (v6.2#6252)