[
https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966333#action_12966333
]
Dmitriy Lyubimov commented on MAHOUT-376:
-----------------------------------------
{quote}
Sounds to me like the reducer could replicate the combiner and thus implement
the second step of your hierarchy which would avoid the second MR pass. You
could have a single reducer which receives all combiner outputs and thus merge
everything.
{quote}
First of all, second level hierarchical merge is done in Bt job mappers -- so
moving this code into reducer wouldn't actually really win anything in terms of
actual IO or # of MR passes.
Secondly, I don't beleive single reducer pass is efficient. All
parallelization is gone, isn't it? But the problem is not even that but
limitation on how many Rs you can preload for the merges into same process. Rs
are preloaded as a side info (and we relinquish one R with every Q we process
in combiner, so initially it loads all but then throws them away one by one.
The catch is that each Q update has to complete the merges of the reminder of
the R sequence. That's described in computeQHatSequence algorithm in my notes.
Finally, there's no need to sort. When we do R merge, as it is shown in the
notes, we have to revisit Q blocks (and Rs) exactly in the same order we just
produced them in. Problem is that when we revisit a Q block, we have to have
the tail of all subsequent R blocks handy. So even if we sent them to reducers
(which was my initial plan), we'd have to sort them by task id they came from
and then by their order inside the task that produced them. And then we might
need to duplicate R traffic to every reducer process. Huge waste, again.
{quote}
Since you can't guarantee that the combiner does any work, this is best
practice anyway. The specification is that the combiner will run zero or more
times.
{quote}
That actually is absolutely valid and i was very concerned about it in the
beginning. I expected it but it did not turn up in tests so far, so that issue
was slowly glimmering in my subconsciousness ever since.
If that actually is a problem then yes I might be forced to extend Q pass to
include reducer. Which IMO would be a humongous exercise in IO bandwidth waste.
there's no need for that. there's only need a for local merge of R sequences
and doing second pass over q block sequence you just generated, in the same
order you generated it. Even a combiner is an overshoot for that, if i were
implementing it in a regular parallel batching way, since there's no need to
reorder q blocks.
I also considered that If i am forced to address that i can still do it in the
mapper only without even a combiner by pushing Q blocks to local FS side files
and doing a second pass over them and that'd be much more efficient than
pushing them to another networked process. Not mentioning all the unneeded
sorting involved in between and all the armtwisting i was struggling with
involved in sending them in same task chunks, same order. There's still
rudimentary remainder of that attempt in the code (the key that carries task id
and order id for Q block that are never subsequently used anywhere). But saving
local side file in mapper and do second pass over it is still way more
attractive than reducer approach if i need to fix this.
{quote}
This also raises the question of whether your combiner can be applied multiple
times. I suspect yes. You will know better than I.
{quote}
Yes, that's the optional hierarchy increment that i was talking about, that we
can additionally implement if billion for m and unbound n is not good enough.
> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
> Key: MAHOUT-376
> URL: https://issues.apache.org/jira/browse/MAHOUT-376
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Reporter: Ted Dunning
> Assignee: Ted Dunning
> Fix For: 0.5
>
> Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for
> mapreduce.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR
> decomposition for Map.pdf, sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf,
> sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf, SSVD working
> notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz,
> ssvd-CDH3-or-0.21.patch.gz, ssvd-m1.patch.gz, ssvd-m2.patch.gz,
> ssvd-m3.patch.gz, Stochastic SVD using eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.