[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005049#comment-14005049
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
------------------------------------------

ok. thank you all, it was helpful. 

First, few clarifications on questions 

bq. Naive question - Are these "Data frame" bindings really for just 
interactive use case? Or do we expect ML algos to be implemented on top of Data 
frames (instead of just DRM/matrix)?

I don't know -- what are matrices vs. data frames in R? Same here. There are 
algorithms that run on Data Frames. There are algorithms that run on matrices. 

I can tell you what I need data frames for. 

I need them for business rule data manipulation per dplyr/MLTable apis, since 
matrices do not support those. 

I also need them to represent feature data such as text or category, since 
matrices do not support anything but real values. 

I need DF for so called standartization (vectorization) of such features. 

I need DF to build hashing trick vectorization.

I probably need DF for outlier detection.

bq. It's just plain good Physics as to why the world works that way. <...? 
Compression/Decompression ain't gonna matter here; it'

Ok i think we can agree that if we rewrite the entire vector, it is not just 
not going to matter, it is simply an extra what otherwise is being done. In my 
business rules code i done in R in past week for a new model feed, i found more 
than trivial amount of me doing column replacement with a completely new column.

Here is what i think 

(1) we need both compressed and uncompressed representation of dense 
beyond-numeric vectors. 
(2) we should use compression whenever I/O serialization is involved.
(3) we should use compression whenever cached checkpoint is created (as this 
almost always implies repeated read re-use).
(4) Otherwise, lazy compression policy by default: we don't compress result 
unless specific api is involved instructing to perform such transformation for 
requested columns explicitly (except for cases mentioned above).


> Data frame R-like bindings
> --------------------------
>
>                 Key: MAHOUT-1490
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1490
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Saikat Kanjilal
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to