[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003903#comment-14003903
 ] 

Anand Avati commented on MAHOUT-1490:
-------------------------------------

[~dlyubimov], Compression does not make it read-only, certainly not read-only 
like Spark's RDD. Data in a Frame is mutable. Depending on the type of update 
either the update is cheap (if a new value can replace old value in-place) or 
expensive (inflate, update1, update2, update3 .. deflate) but in any case 
happens transparently behind the scene. User just calls set(). However, for the 
DSL backend I intend to _not_ mutate Frames and treat them read-only to be 
compatible with the Spark RDD model (even though it might not be the most 
efficient in certain cases in terms of performance).

Speed to access data is constant time for dense compressed data with negligible 
decompression overhead (one multiplaction and one addition instruction with 
operands in registers). The chunk header knows the scale-down factor of 
compression, so it is a deterministic offset lookup to fetch the compressed 
value as well. For sparse data however the worst case is a binary search to 
find the physical offset within a Chunk, though there are optimizations to make 
further accesses in the same vicinity to happen in constant time.

> Data frame R-like bindings
> --------------------------
>
>                 Key: MAHOUT-1490
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1490
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Saikat Kanjilal
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to