[ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157669#comment-13157669
 ] 

Lance Norskog commented on MAHOUT-884:
--------------------------------------

Map-reduce does not handle this well. There are two ways to implement this in 
Hadoop:
# Null Mapper -> Reducer<IntWritable,VectorWritable>
** The Reducer loads iterators for all VectorWritables, then walks forward 
monotonically through all iterators.
# Mapper -> Partitioner<1 Reducer per row> -> (Reducer<IntWritable index, 
DoubleWritable value>)
** More: Reducer setup/teardown creates an output VectorWritable, each reduce() 
call receives one vector index and one or more values.

The first requires loading into memory the contents for row X, from each 
matrix, simultaneously. ConcatenateMatrices already has this problem, and does 
not copy the vectors over the network. The second is a "map-increase" 
algorithm: it creates a separate key pair for each value in the output matrix. 
Neither of these scale. 

The only way to do this is to precondition the input matrices into one file 
with ordered rows, and use the above single-threaded concatenator. If you want 
multiple files, you can partition the matrices into matching sets of rows: 
part-r-00000 is row 0->499, part-r-00001 is row 500->999... etc. You then run 
ConcatenateMatrices on each pair.



                
> Matrix Concatenate utility
> --------------------------
>
>                 Key: MAHOUT-884
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-884
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Integration
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: MAHOUT-884.patch, MAHOUT-884.patch
>
>
> Utility to concatenate matrices stored as SequenceFiles of vectors.
> Each pair in the SequenceFile is the IntWritable row number and a 
> VectorWritable.
> The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to