[jira] [Updated] (MAHOUT-694) IndexOutOfBoundException using build-reuters.sh

Drew Farris (JIRA) Sat, 21 May 2011 08:19:29 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Drew Farris updated MAHOUT-694:
-------------------------------

    Attachment: MAHOUT-694.patch

Updated build-reuters.sh to be a bit more sane about its working directory, 
sensitive to whether mahout will execute in local or distributed mode and to 
reuse the reuters-reuters-sgm, reuters-out and reuters-out-seqdir directories 
if they already exist.

Work is done in a directory called mahout-work, seqdirectory is always run 
locally, and if in distributed mode, the result are copied up to hdfs. 

Also updated output of lda and kmeans so that their directory names do not 
overlap.

This could do a better job of cleaning up before/after it executes, but it gets 
the job done.

Please take a look and give this a try, I've run kmeans in local and 
distributed mode (forcing local by unsetting HADOOP_HOME), I am running lda in 
distributed mode now, but my cluster is small and it takes a long time to 
complete. 
 

> IndexOutOfBoundException using build-reuters.sh
> -----------------------------------------------
>
>                 Key: MAHOUT-694
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-694
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.5
>         Environment: Linux Debian Lenny
> Hadoop 0.20 (Cloudera)
>            Reporter: Allan BLANCHARD
>            Assignee: Drew Farris
>             Fix For: 0.5
>
>         Attachments: MAHOUT-694.patch, MAHOUT-694.patch, MAHOUT-694.patch
>
>
> I run Hadoop-0.20 on distributed mode on 10 VMs (NameNode + JobTracker + 8 
> DataNodes/TaskTrackers) with Mahout trunk.
> I tried to test kmeans example with build-reuters.sh but I got an 
> IndexOutOfBoundException when it starts kmeans.
> I don't know which operation fails ... ExtractReuters, seqdirectory, 
> seq2sparse or kmeans. Maybe I forgot a configuration ? I searched on the web 
> and didn't find solutions ... 
> ------------------------ UPDATE == 05/16 -------------------------
> NameNode:/usr/local/mahout/trunk/examples/bin# ./build-reuters.sh 
> Please select a number to choose the corresponding clustering algorithm
> 1. kmeans clustering
> 2. lda clustering
> Enter your choice : 1
> ok. You chose 1 and we'll use kmeans Clustering
> ./build-reuters.sh: line 39: cd: examples/bin/: No such file or directory
> Downloading Reuters-21578
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  
> Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 7959k  100 7959k    0     0   121k      0  0:01:05  0:01:05 --:--:--   99k
> Extracting...
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:20 WARN driver.MahoutDriver: No 
> org.apache.lucene.benchmark.utils.ExtractReuters.props found on classpath, 
> will use command-line arguments only
> Deleting all files in ./examples/bin/work/reuters-out/-tmp
> 11/05/16 09:31:24 INFO driver.MahoutDriver: Program took 3471 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:26 INFO common.AbstractJob: Command line arguments: 
> {--charset=UTF-8, --chunkSize=5, --endPhase=2147483647, 
> --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter, 
> --input=./examples/bin/work/reuters-out/, --keyPrefix=, 
> --output=./examples/bin/work/reuters-out-seqdir, --startPhase=0, 
> --tempDir=temp}
> 11/05/16 09:31:26 INFO driver.MahoutDriver: Program took 398 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum 
> n-gram size is: 1
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR 
> value: 1.0
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of 
> reduce tasks: 1
> 11/05/16 09:31:29 INFO input.FileInputFormat: Total input paths to process : 1
> 11/05/16 09:31:29 INFO mapred.JobClient: Running job: job_201105160929_0001
> 11/05/16 09:31:30 INFO mapred.JobClient:  map 0% reduce 0%
> 11/05/16 09:31:40 INFO mapred.JobClient:  map 100% reduce 0%
> 11/05/16 09:31:42 INFO mapred.JobClient: Job complete: job_201105160929_0001
> [...]
> 11/05/16 09:33:58 INFO common.HadoopUtil: Deleting 
> examples/bin/work/reuters-out-seqdir-sparse/partial-vectors-0
> 11/05/16 09:33:58 INFO driver.MahoutDriver: Program took 149846 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:34:00 INFO common.AbstractJob: Command line arguments: 
> {--clusters=./examples/bin/work/clusters, --convergenceDelta=0.5, 
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>  --endPhase=2147483647, 
> --input=./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/, 
> --maxIter=10, --method=mapreduce, --numClusters=20, 
> --output=./examples/bin/work/reuters-kmeans, --overwrite=null, 
> --startPhase=0, --tempDir=temp}
> 11/05/16 09:34:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 11/05/16 09:34:00 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> 11/05/16 09:34:00 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, 
> Size: 0
>       at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>       at java.util.ArrayList.get(ArrayList.java:322)
>       at 
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
>       at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> ------------------------------------------------------------------
> EDIT : I just tried this on Mahout 0.4 and it seems to work (I use the same 
> VM configuration). 
> PS : Sorry for my very bad english :(

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-694) IndexOutOfBoundException using build-reuters.sh

Reply via email to