[ https://issues.apache.org/jira/browse/MAHOUT-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Drew Farris updated MAHOUT-694: ------------------------------- Attachment: MAHOUT-694.patch Updated build-reuters.sh to be a bit more sane about its working directory, sensitive to whether mahout will execute in local or distributed mode and to reuse the reuters-reuters-sgm, reuters-out and reuters-out-seqdir directories if they already exist. Work is done in a directory called mahout-work, seqdirectory is always run locally, and if in distributed mode, the result are copied up to hdfs. Also updated output of lda and kmeans so that their directory names do not overlap. This could do a better job of cleaning up before/after it executes, but it gets the job done. Please take a look and give this a try, I've run kmeans in local and distributed mode (forcing local by unsetting HADOOP_HOME), I am running lda in distributed mode now, but my cluster is small and it takes a long time to complete. > IndexOutOfBoundException using build-reuters.sh > ----------------------------------------------- > > Key: MAHOUT-694 > URL: https://issues.apache.org/jira/browse/MAHOUT-694 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.5 > Environment: Linux Debian Lenny > Hadoop 0.20 (Cloudera) > Reporter: Allan BLANCHARD > Assignee: Drew Farris > Fix For: 0.5 > > Attachments: MAHOUT-694.patch, MAHOUT-694.patch, MAHOUT-694.patch > > > I run Hadoop-0.20 on distributed mode on 10 VMs (NameNode + JobTracker + 8 > DataNodes/TaskTrackers) with Mahout trunk. > I tried to test kmeans example with build-reuters.sh but I got an > IndexOutOfBoundException when it starts kmeans. > I don't know which operation fails ... ExtractReuters, seqdirectory, > seq2sparse or kmeans. Maybe I forgot a configuration ? I searched on the web > and didn't find solutions ... > ------------------------ UPDATE == 05/16 ------------------------- > NameNode:/usr/local/mahout/trunk/examples/bin# ./build-reuters.sh > Please select a number to choose the corresponding clustering algorithm > 1. kmeans clustering > 2. lda clustering > Enter your choice : 1 > ok. You chose 1 and we'll use kmeans Clustering > ./build-reuters.sh: line 39: cd: examples/bin/: No such file or directory > Downloading Reuters-21578 > % Total % Received % Xferd Average Speed Time Time Time > Current > Dload Upload Total Spent Left Speed > 100 7959k 100 7959k 0 0 121k 0 0:01:05 0:01:05 --:--:-- 99k > Extracting... > Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 > No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf > 11/05/16 09:31:20 WARN driver.MahoutDriver: No > org.apache.lucene.benchmark.utils.ExtractReuters.props found on classpath, > will use command-line arguments only > Deleting all files in ./examples/bin/work/reuters-out/-tmp > 11/05/16 09:31:24 INFO driver.MahoutDriver: Program took 3471 ms > Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 > No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf > 11/05/16 09:31:26 INFO common.AbstractJob: Command line arguments: > {--charset=UTF-8, --chunkSize=5, --endPhase=2147483647, > --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter, > --input=./examples/bin/work/reuters-out/, --keyPrefix=, > --output=./examples/bin/work/reuters-out-seqdir, --startPhase=0, > --tempDir=temp} > 11/05/16 09:31:26 INFO driver.MahoutDriver: Program took 398 ms > Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 > No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf > 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum > n-gram size is: 1 > 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR > value: 1.0 > 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of > reduce tasks: 1 > 11/05/16 09:31:29 INFO input.FileInputFormat: Total input paths to process : 1 > 11/05/16 09:31:29 INFO mapred.JobClient: Running job: job_201105160929_0001 > 11/05/16 09:31:30 INFO mapred.JobClient: map 0% reduce 0% > 11/05/16 09:31:40 INFO mapred.JobClient: map 100% reduce 0% > 11/05/16 09:31:42 INFO mapred.JobClient: Job complete: job_201105160929_0001 > [...] > 11/05/16 09:33:58 INFO common.HadoopUtil: Deleting > examples/bin/work/reuters-out-seqdir-sparse/partial-vectors-0 > 11/05/16 09:33:58 INFO driver.MahoutDriver: Program took 149846 ms > Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 > No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf > 11/05/16 09:34:00 INFO common.AbstractJob: Command line arguments: > {--clusters=./examples/bin/work/clusters, --convergenceDelta=0.5, > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, > --endPhase=2147483647, > --input=./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/, > --maxIter=10, --method=mapreduce, --numClusters=20, > --output=./examples/bin/work/reuters-kmeans, --overwrite=null, > --startPhase=0, --tempDir=temp} > 11/05/16 09:34:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library > 11/05/16 09:34:00 INFO zlib.ZlibFactory: Successfully loaded & initialized > native-zlib library > 11/05/16 09:34:00 INFO compress.CodecPool: Got brand-new compressor > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, > Size: 0 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > ------------------------------------------------------------------ > EDIT : I just tried this on Mahout 0.4 and it seems to work (I use the same > VM configuration). > PS : Sorry for my very bad english :( -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira