[
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157669#comment-13157669
]
Lance Norskog commented on MAHOUT-884:
--------------------------------------
Map-reduce does not handle this well. There are two ways to implement this in
Hadoop:
# Null Mapper -> Reducer<IntWritable,VectorWritable>
** The Reducer loads iterators for all VectorWritables, then walks forward
monotonically through all iterators.
# Mapper -> Partitioner<1 Reducer per row> -> (Reducer<IntWritable index,
DoubleWritable value>)
** More: Reducer setup/teardown creates an output VectorWritable, each reduce()
call receives one vector index and one or more values.
The first requires loading into memory the contents for row X, from each
matrix, simultaneously. ConcatenateMatrices already has this problem, and does
not copy the vectors over the network. The second is a "map-increase"
algorithm: it creates a separate key pair for each value in the output matrix.
Neither of these scale.
The only way to do this is to precondition the input matrices into one file
with ordered rows, and use the above single-threaded concatenator. If you want
multiple files, you can partition the matrices into matching sets of rows:
part-r-00000 is row 0->499, part-r-00001 is row 500->999... etc. You then run
ConcatenateMatrices on each pair.
> Matrix Concatenate utility
> --------------------------
>
> Key: MAHOUT-884
> URL: https://issues.apache.org/jira/browse/MAHOUT-884
> Project: Mahout
> Issue Type: New Feature
> Components: Integration
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: MAHOUT-884.patch, MAHOUT-884.patch
>
>
> Utility to concatenate matrices stored as SequenceFiles of vectors.
> Each pair in the SequenceFile is the IntWritable row number and a
> VectorWritable.
> The input and output files may skip rows.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira