[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004037#comment-14004037
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
------------------------------------------

if inflate/deflate cycle is needed to update an element, i take it it changes 
entire backing chunk representation, doesn't it. That's what i mean by 
immutable, scala immutable apis are then mutable in the same sense (only by 
applying a functor).

It is important because data frame blockwise transformations may be updating 
its content (vector chunks) at random coordinates. Obviously inflate-deflate 
cycle for _each_ update makes it incredibly inefficient. You seem to imply the 
cycle of inflate -> do all local task updates -> deflate again. this is far 
from a general algorithm pattern of random elementwise gets and sets (in-core 
operations with getQuick() and setQuick() in Mahout's sense). it also has 
further profound distributed plan implications (determine boundaries of single 
map() fusion operation in order to avoid inflate-deflate cycle between fused 
functors and monads etc). 

So the bottom line that i am driving at is if we consider a generic algorithm 
that does random sequence of element reads and writes, it can't really 
trivially capitalize on reading speed because essentially it would have to 
start working on inflated representation at potentially first random write. The 
only thing that more or less works is read-only access, and, for most part, 
sequential readonly access, which shines at compiling condensed summaries, but 
that's about it. In that, it seems to be awfully similar to 
SequentialAccessSparseVector (as opposed to RandomAccessSparseVector) in Mahout.

Sequential access usually implies a functor or monad, i.e. immutability of the 
source, and sequential result construction. This is the happiest part for this 
approach and also is incredibly common.

Nonsequential access may imply both in-place (i.e. writes to source) and 
non-inplace random writes (i.e. writes to a separate output). This happens in 
some cases as well. 

The challenge here is to find a balance, or somehow expose costs, of sequential 
access vs. random access vs. writes to the client algorithm for a particular 
backing implementation.

> Data frame R-like bindings
> --------------------------
>
>                 Key: MAHOUT-1490
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1490
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Saikat Kanjilal
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to