HMM Model error
When i am trying to run the sample from http://mahout.apache.org/users/classification/hidden-markov-models.html the model is running fine. However when i give a different sequence like below i see the error mentioned below: echo 0 3 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 67 70 73 76 79 82 85 88 91 94 97 100 103 106 54 56 57 59 60 62 63 65 hmm-input mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001 -m 1000 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 5 at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:78) at org.apache.mahout.classifier.sequencelearning.hmm.HmmAlgorithms.forwardAlgorithm(HmmAlgorithms.java:85) at org.apache.mahout.classifier.sequencelearning.hmm.HmmTrainer.trainBaumWelch(HmmTrainer.java:315) at org.apache.mahout.classifier.sequencelearning.hmm.BaumWelchTrainer.main(BaumWelchTrainer.java:116) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) Kindly suggest how i can get ride of this error. Regards,Raghuveer
Re: Replacement for DefaultAnalyzer
Hi Suneel, Just for context, I've implemented the following. @Override protected void map(Text key, BehemothDocument value, Context context) throws IOException, InterruptedException { String sContent = value.getText(); if (sContent == null) { // no text available? skip context.getCounter(LuceneTokenizer, BehemothDocWithoutText) .increment(1); return; } analyzer = new StandardAnalyzer(matchVersion); // or any other analyzer TokenStream ts = analyzer.tokenStream(key.toString(), new StringReader(sContent.toString())); // The Analyzer class will construct the Tokenizer, TokenFilter(s), and CharFilter(s), // and pass the resulting Reader to the Tokenizer. @SuppressWarnings(unused) OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class); CharTermAttribute termAtt = ts .addAttribute(CharTermAttribute.class); StringTuple document = new StringTuple(); try { ts.reset(); // Resets this stream to the beginning. (Required) while (ts.incrementToken()) { if (termAtt.length() 0) { document.add(new String(termAtt.buffer(), 0, termAtt.length())); } } ts.end(); // Perform end-of-stream operations, e.g. set the final offset. } finally { ts.close(); // Release resources associated with this stream. } context.write(key, document); } I'll be testing and will update is anything else comes up. Thanks Lewis On Mon, May 11, 2015 at 2:12 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I found Mike's blog post regarding Lucene 4.X from a while ago [0]. In the* '*Other Changes*'* section Mike states Analyzers must always provide a reusable token stream, by implementing the Analyzer.createComponents method (reusableTokenStream has been removed and tokenStream is now final, in Analzyer). This provides a good bit ore context therefore I'm going to continue on createComponents route with the aim of implementing the newer 4.X Lucene API. In the meantime, if you get any updated or have a code sample it would be very much appreciated. Thanks Lewis [0] http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html On Mon, May 11, 2015 at 2:03 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Suneel, On Sat, May 9, 2015 at 11:21 AM, Suneel Marthi smar...@apache.org wrote: Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in the TokenStream workflow in Lucene post-Lucene 4.5. Yes I know that after looking into the codebase. Thanks for clarifying! What exactly are u trying to do and where is it u r stuck now? It would help if u posted a code snippet or something. In particular I am working on the following implementation [0] which uses the following code TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(sContent.toString())); Of note here is that the analyzer object is instantiated as of type DefaultAnalyzer [1]. It is further noted that the analyzer.reusableTokenStream API is deprecated as you've noted so I am just wondering what the suggested API semantics are in order to achieve the desired upgrade. Thanks in advance again for any input. Lewis [0] https://github.com/DigitalPebble/behemoth/blob/master/mahout/src/main/java/com/digitalpebble/behemoth/mahout/LuceneTokenizerMapper.java#L52-L53 [1] http://svn.apache.org/repos/asf/mahout/tags/mahout-0.7/core/src/main/java/org/apache/mahout/vectorizer/DefaultAnalyzer.java -- *Lewis* -- *Lewis*
Re: Replacement for DefaultAnalyzer
I found Mike's blog post regarding Lucene 4.X from a while ago [0]. In the* '*Other Changes*'* section Mike states Analyzers must always provide a reusable token stream, by implementing the Analyzer.createComponents method (reusableTokenStream has been removed and tokenStream is now final, in Analzyer). This provides a good bit ore context therefore I'm going to continue on createComponents route with the aim of implementing the newer 4.X Lucene API. In the meantime, if you get any updated or have a code sample it would be very much appreciated. Thanks Lewis [0] http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html On Mon, May 11, 2015 at 2:03 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Suneel, On Sat, May 9, 2015 at 11:21 AM, Suneel Marthi smar...@apache.org wrote: Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in the TokenStream workflow in Lucene post-Lucene 4.5. Yes I know that after looking into the codebase. Thanks for clarifying! What exactly are u trying to do and where is it u r stuck now? It would help if u posted a code snippet or something. In particular I am working on the following implementation [0] which uses the following code TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(sContent.toString())); Of note here is that the analyzer object is instantiated as of type DefaultAnalyzer [1]. It is further noted that the analyzer.reusableTokenStream API is deprecated as you've noted so I am just wondering what the suggested API semantics are in order to achieve the desired upgrade. Thanks in advance again for any input. Lewis [0] https://github.com/DigitalPebble/behemoth/blob/master/mahout/src/main/java/com/digitalpebble/behemoth/mahout/LuceneTokenizerMapper.java#L52-L53 [1] http://svn.apache.org/repos/asf/mahout/tags/mahout-0.7/core/src/main/java/org/apache/mahout/vectorizer/DefaultAnalyzer.java -- *Lewis*
Re: Replacement for DefaultAnalyzer
Hi Suneel, On Sat, May 9, 2015 at 11:21 AM, Suneel Marthi smar...@apache.org wrote: Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in the TokenStream workflow in Lucene post-Lucene 4.5. Yes I know that after looking into the codebase. Thanks for clarifying! What exactly are u trying to do and where is it u r stuck now? It would help if u posted a code snippet or something. In particular I am working on the following implementation [0] which uses the following code TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(sContent.toString())); Of note here is that the analyzer object is instantiated as of type DefaultAnalyzer [1]. It is further noted that the analyzer.reusableTokenStream API is deprecated as you've noted so I am just wondering what the suggested API semantics are in order to achieve the desired upgrade. Thanks in advance again for any input. Lewis [0] https://github.com/DigitalPebble/behemoth/blob/master/mahout/src/main/java/com/digitalpebble/behemoth/mahout/LuceneTokenizerMapper.java#L52-L53 [1] http://svn.apache.org/repos/asf/mahout/tags/mahout-0.7/core/src/main/java/org/apache/mahout/vectorizer/DefaultAnalyzer.java
Re: HMM Model error
Can you please tell me how is it 107 because i have only 64 elements and if i remove all the spaces its 90 elements, can you kindly explain. On Monday, May 11, 2015 5:21 PM, Max Heimel mhei...@gmail.com wrote: Hi Raghuveer, the crash was caused because you did not provide the correct number of observed states (in your case: 107) to the -no argument of the BaumWelch trainer. (The trainer expects that the states in the provided sequence are encoded as integers from 0 to nr_states-1.) Max 2015-05-11 12:25 GMT+02:00 Raghuveer alwaysra...@yahoo.com.invalid: When i am trying to run the sample from http://mahout.apache.org/users/classification/hidden-markov-models.html the model is running fine. However when i give a different sequence like below i see the error mentioned below: echo 0 3 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 67 70 73 76 79 82 85 88 91 94 97 100 103 106 54 56 57 59 60 62 63 65 hmm-input mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001 -m 1000 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 5 at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:78) at org.apache.mahout.classifier.sequencelearning.hmm.HmmAlgorithms.forwardAlgorithm(HmmAlgorithms.java:85) at org.apache.mahout.classifier.sequencelearning.hmm.HmmTrainer.trainBaumWelch(HmmTrainer.java:315) at org.apache.mahout.classifier.sequencelearning.hmm.BaumWelchTrainer.main(BaumWelchTrainer.java:116) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) Kindly suggest how i can get ride of this error. Regards,Raghuveer
Re: HMM Model error
When i run as you suggest i got the resultsas below: Initial probabilities: 0 1 2 NaN NaN NaN Transition matrix: 0 1 2 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN Emission matrix: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15/05/12 11:05:17 INFO driver.MahoutDriver: Program took 569 ms (Minutes: 0.009483) So the final result that i see when i run cat hmm-predictions is: 0 0 0 0 0 0 0 0 0 0 Is this correct or my initial data is incorrect? On Tuesday, May 12, 2015 11:16 AM, Raghuveer alwaysra...@yahoo.com.INVALID wrote: Can you please tell me how is it 107 because i have only 64 elements and if i remove all the spaces its 90 elements, can you kindly explain. On Monday, May 11, 2015 5:21 PM, Max Heimel mhei...@gmail.com wrote: Hi Raghuveer, the crash was caused because you did not provide the correct number of observed states (in your case: 107) to the -no argument of the BaumWelch trainer. (The trainer expects that the states in the provided sequence are encoded as integers from 0 to nr_states-1.) Max 2015-05-11 12:25 GMT+02:00 Raghuveer alwaysra...@yahoo.com.invalid: When i am trying to run the sample from http://mahout.apache.org/users/classification/hidden-markov-models.html the model is running fine. However when i give a different sequence like below i see the error mentioned below: echo 0 3 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 67 70 73 76 79 82 85 88 91 94 97 100 103 106 54 56 57 59 60 62 63 65 hmm-input mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001 -m 1000 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 5 at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:78) at org.apache.mahout.classifier.sequencelearning.hmm.HmmAlgorithms.forwardAlgorithm(HmmAlgorithms.java:85) at org.apache.mahout.classifier.sequencelearning.hmm.HmmTrainer.trainBaumWelch(HmmTrainer.java:315) at org.apache.mahout.classifier.sequencelearning.hmm.BaumWelchTrainer.main(BaumWelchTrainer.java:116) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) Kindly suggest how i can get ride of this error. Regards,Raghuveer