-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/
-----------------------------------------------------------

(Updated 2011-12-23 23:14:34.869723)


Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.


Changes
-------

Replaced IntWritable with WritableComparable so that any key class can be used. 
Added instantiation of Configuration to make sure tests pass when using 
SplitInputJob from within code


Summary
-------

Early support for randomizing input in SplitInput class. This is an early start 
but I've posted it up just to check if I'm on the right track.  A couple of 
comments:

  - currently the code runs through the entire file looking for the line 
corresponding to the random index. This has to be repeated for every line, 
which is slow and somewhat ugly.
  - the permutation indices are stored in an array. This could lead to scaling 
issues if the number of input lines is large. This problem may also exist with 
ridx in the existing code. One option is to use a linear feedback shift 
register to generate a permutation sequence on the fly.

Any suggestions would be very welcome!


This addresses bug MAHOUT-904.
    https://issues.apache.org/jira/browse/MAHOUT-904


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1221886 
  /trunk/examples/bin/asf-email-examples.sh 1221886 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 
1221886 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java 
PRE-CREATION 
  /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 
1221886 

Diff: https://reviews.apache.org/r/3092/diff


Testing
-------


Thanks,

Raphael

Reply via email to