[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004783#comment-14004783
 ] 

Cliff Click commented on MAHOUT-1490:
-------------------------------------

I didn't see any obvious errors in your responses.

These complaints about inflation/deflation on random access are "true" but 
generally groundless worries.

If you are doing truly random access, then performance of ALL algo's is gonna 
suck; ALL modern hardware and certainly all X86's are heavily optimized around 
re-use in space and time; especially the obvious linear-access case.  It's just 
plain good Physics as to why the world works that way.  There's an easy 10x to 
100x or better, going in a straight line over all the data, vs randomly popping 
about.  Compression/Decompression ain't gonna matter here; it's all about 
Physics and trading off latency vs bandwidth.

I think physics is gonna dictate that random-access-algo's are gonna lose out 
to bulk algos, just because you can get so much more work done in the same 
period of time.  Perhaps there's a middle ground; where - at the cost of 1 
random access - you grab the 100 nearby neighbors, and do work with 100 
elements instead of 1.

The inflate/deflate cycle only kicks in if you're dramatically changing the 
"shape" of the data.  Nearly always this isn't true. 
Example: hacking tree or array-indices; always the indices are small integers 
even as they change constantly.  Compress handily back into any of the 
small-integer formats. 
Example: hacking regression values in an iterative algo; always the predictors 
are "floats" or "doubles", and handily get stored back into the standard 
float/double "not really compressed" formats.

What IS true, and expensive, is the open/close cycle we use around Chunks to 
track when changes happen (visibility of changes & coherence around the 
cluster).  Random reads don't pay this, but random writes do.  Normally this 
cost is amortized over visiting the entire Chunk, but it's very real if you are 
reading only a few elements and writing at least once.

Cliff



> Data frame R-like bindings
> --------------------------
>
>                 Key: MAHOUT-1490
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1490
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Saikat Kanjilal
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to