Re: Mahout LDA Parameter: maxIter (--numWords actually)

Sean Owen Sun, 23 May 2010 15:29:59 -0700

Switching to dev list.

I don't want to belabor a small point, but I'm wondering whether it's
just better to check whatever array access causes the problem before
it's accessed?


Meaning...

if (foo >= array.length) {
  throw new IllegalStateException();
}
... array[foo] ...

instead of

try {
... array[foo] ...
} catch (ArrayIndexOutOfBoundsException aioobe) {
  throw new IllegalStateException();
}

On Sun, May 23, 2010 at 11:26 PM, Jeff Eastman
<[email protected]> wrote:
> I added a try block in the mapper which catches the exception and outputs:
>
> java.lang.IllegalStateException: This is probably because the --numWords
> argument is set too small.
>    It needs to be >= than the number of words (terms actually) in the corpus
> and can be
>    larger if some storage inefficiency can be tolerated.
>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
>    at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
>    at
> org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
>    at
> org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
>    at
> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
>    ... 5 more
>
> I'll commit that for now while we explore a more elegant solution.
>
>
> On 5/23/10 2:45 PM, Sean Owen wrote:
>>
>> Even something as simple as checking that bound and throwing
>> IllegalStateException with a custom message -- yeah I imagine it's
>> hard to detect this anytime earlier. Just a thought.
>>
>> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
>> <[email protected]>  wrote:
>>
>>>
>>> I agree it is not very friendly. Impossible to tell the correct value in
>>> the
>>> options section processing. It needs to be>= than the actual number of
>>> unique terms in the corpus and that is hard to anticipate though I think
>>> it
>>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>>> investigating), then it could be computed by adding a dictionary path
>>> argument instead of the current option. Trouble with that is the
>>> dictionary
>>> is not needed for anything else by LDA.
>>>
>>>
>>
>>
>
>

Re: Mahout LDA Parameter: maxIter (--numWords actually)

Reply via email to