It might be worth thinking about since catching for this over a fair bit of code might be catching other unrelated conditions. Is the issue not that phiW may not have the same cardinality as nextGamma? if that's it that's an easy check but maybe that's not it.
I might hijack this to question the existence of getQuick(). It's for situations where the caller "knows" the dimension is in bounds, but perhaps it's too tempting to just call this even when that's not known. Here calling get() would have just resulted in a different and only slightly-better exception. In another implementation it might have silently failed by returning a zero or something though. One thought is to remove getQuick(). It has performance implications. Maybe many uses of getQuick() could be better constructed as iterations, I don't know. But another smaller changed would be to use get() and "assert" the range checks. Unit tests would enable these; production code wouldn't. They're off by default. Is that compelling to anyone On Sun, May 23, 2010 at 11:35 PM, Jeff Eastman <[email protected]> wrote: > Its caused by a getQuick in the bowels of infer(). Adding the test to every > access inside the loop seems more expensive than just catching the exception > in the mapper. > > On 5/23/10 3:29 PM, Sean Owen wrote: >> >> Switching to dev list. >> >> I don't want to belabor a small point, but I'm wondering whether it's >> just better to check whatever array access causes the problem before >> it's accessed? >> >> Meaning... >> >> if (foo>= array.length) { >> throw new IllegalStateException(); >> } >> ... array[foo] ... >> >> instead of >> >> try { >> ... array[foo] ... >> } catch (ArrayIndexOutOfBoundsException aioobe) { >> throw new IllegalStateException(); >> } >> >> On Sun, May 23, 2010 at 11:26 PM, Jeff Eastman >> <[email protected]> wrote: >> >>> >>> I added a try block in the mapper which catches the exception and >>> outputs: >>> >>> java.lang.IllegalStateException: This is probably because the --numWords >>> argument is set too small. >>> It needs to be>= than the number of words (terms actually) in the >>> corpus >>> and can be >>> larger if some storage inefficiency can be tolerated. >>> at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49) >>> at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>> at >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) >>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816 >>> at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77) >>> at >>> >>> org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44) >>> at >>> >>> org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205) >>> at >>> >>> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103) >>> at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47) >>> ... 5 more >>> >>> I'll commit that for now while we explore a more elegant solution. >>> >>> >>> On 5/23/10 2:45 PM, Sean Owen wrote: >>> >>>> >>>> Even something as simple as checking that bound and throwing >>>> IllegalStateException with a custom message -- yeah I imagine it's >>>> hard to detect this anytime earlier. Just a thought. >>>> >>>> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman >>>> <[email protected]> wrote: >>>> >>>> >>>>> >>>>> I agree it is not very friendly. Impossible to tell the correct value >>>>> in >>>>> the >>>>> options section processing. It needs to be>= than the actual number of >>>>> unique terms in the corpus and that is hard to anticipate though I >>>>> think >>>>> it >>>>> is known in seq2sparse. If it turns out to be the dictionary size (I'm >>>>> investigating), then it could be computed by adding a dictionary path >>>>> argument instead of the current option. Trouble with that is the >>>>> dictionary >>>>> is not needed for anything else by LDA. >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
