[ 
https://issues.apache.org/jira/browse/MAHOUT-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036981#comment-13036981
 ] 

Grant Ingersoll commented on MAHOUT-694:
----------------------------------------

Here's the 0.4 code:
{code}
cd examples/bin/
mkdir -p work
if [ ! -e work/reuters-out ]; then
  if [ ! -e work/reuters-sgm ]; then
    if [ ! -f work/reuters21578.tar.gz ]; then
      echo "Downloading Reuters-21578"
      curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz  
-o work/reuters21578.tar.gz
    fi
    mkdir -p work/reuters-sgm
    echo "Extracting..."
    cd work/reuters-sgm && tar xzf ../reuters21578.tar.gz && cd .. && cd ..
  fi
fi

cd ../..
./bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters 
./examples/bin/work/reuters-sgm/ ./examples/bin/work/reuters-out/
./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o 
./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5

# to use k-Means clustering, uncomment the next three lines
./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o 
./examples/bin/work/reuters-out-seqdir-sparse
./bin/mahout kmeans -i 
./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c 
./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 10 -k 20 
-ow
./bin/mahout clusterdump -s examples/bin/work/reuters-kmeans/clusters-10 -d 
examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt sequencefile 
-b 100 -n 20
{code}

If it worked in 0.4 on a cluster, wouldn't that mean that seqdirectory 
(SEquenceFilesFromDirectory) read them in from the local filesystem and output 
it's results to HDFS?  Is that doable in Hadoop, since we don't even pass in 
URI's when calling this?  If so, what changed for us?  I see there was some 
refactoring in this area between 0.4 and now, but I certainly don't think this 
is simply caused by a missing trailing slash.  I don't have handy access to a 
cluster at the moment.  

> IndexOutOfBoundException using build-reuters.sh
> -----------------------------------------------
>
>                 Key: MAHOUT-694
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-694
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.5
>         Environment: Linux Debian Lenny
> Hadoop 0.20 (Cloudera)
>            Reporter: Allan BLANCHARD
>            Assignee: Drew Farris
>             Fix For: 0.5
>
>         Attachments: MAHOUT-694.patch
>
>
> I run Hadoop-0.20 on distributed mode on 10 VMs (NameNode + JobTracker + 8 
> DataNodes/TaskTrackers) with Mahout trunk.
> I tried to test kmeans example with build-reuters.sh but I got an 
> IndexOutOfBoundException when it starts kmeans.
> I don't know which operation fails ... ExtractReuters, seqdirectory, 
> seq2sparse or kmeans. Maybe I forgot a configuration ? I searched on the web 
> and didn't find solutions ... 
> ------------------------ UPDATE == 05/16 -------------------------
> NameNode:/usr/local/mahout/trunk/examples/bin# ./build-reuters.sh 
> Please select a number to choose the corresponding clustering algorithm
> 1. kmeans clustering
> 2. lda clustering
> Enter your choice : 1
> ok. You chose 1 and we'll use kmeans Clustering
> ./build-reuters.sh: line 39: cd: examples/bin/: No such file or directory
> Downloading Reuters-21578
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  
> Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 7959k  100 7959k    0     0   121k      0  0:01:05  0:01:05 --:--:--   99k
> Extracting...
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:20 WARN driver.MahoutDriver: No 
> org.apache.lucene.benchmark.utils.ExtractReuters.props found on classpath, 
> will use command-line arguments only
> Deleting all files in ./examples/bin/work/reuters-out/-tmp
> 11/05/16 09:31:24 INFO driver.MahoutDriver: Program took 3471 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:26 INFO common.AbstractJob: Command line arguments: 
> {--charset=UTF-8, --chunkSize=5, --endPhase=2147483647, 
> --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter, 
> --input=./examples/bin/work/reuters-out/, --keyPrefix=, 
> --output=./examples/bin/work/reuters-out-seqdir, --startPhase=0, 
> --tempDir=temp}
> 11/05/16 09:31:26 INFO driver.MahoutDriver: Program took 398 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum 
> n-gram size is: 1
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR 
> value: 1.0
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of 
> reduce tasks: 1
> 11/05/16 09:31:29 INFO input.FileInputFormat: Total input paths to process : 1
> 11/05/16 09:31:29 INFO mapred.JobClient: Running job: job_201105160929_0001
> 11/05/16 09:31:30 INFO mapred.JobClient:  map 0% reduce 0%
> 11/05/16 09:31:40 INFO mapred.JobClient:  map 100% reduce 0%
> 11/05/16 09:31:42 INFO mapred.JobClient: Job complete: job_201105160929_0001
> [...]
> 11/05/16 09:33:58 INFO common.HadoopUtil: Deleting 
> examples/bin/work/reuters-out-seqdir-sparse/partial-vectors-0
> 11/05/16 09:33:58 INFO driver.MahoutDriver: Program took 149846 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:34:00 INFO common.AbstractJob: Command line arguments: 
> {--clusters=./examples/bin/work/clusters, --convergenceDelta=0.5, 
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>  --endPhase=2147483647, 
> --input=./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/, 
> --maxIter=10, --method=mapreduce, --numClusters=20, 
> --output=./examples/bin/work/reuters-kmeans, --overwrite=null, 
> --startPhase=0, --tempDir=temp}
> 11/05/16 09:34:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 11/05/16 09:34:00 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> 11/05/16 09:34:00 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, 
> Size: 0
>       at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>       at java.util.ArrayList.get(ArrayList.java:322)
>       at 
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
>       at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> ------------------------------------------------------------------
> EDIT : I just tried this on Mahout 0.4 and it seems to work (I use the same 
> VM configuration). 
> PS : Sorry for my very bad english :(

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to