Re: Naive bayes and character n-grams

2013-10-10 Thread Suneel Marthi
Dean, Just a thought. You should be able to create new language models (with LangDetect) if there's Wikipedia content for the specific language, had to do it in the past for Pashto and Malaysian. On Thursday, October 10, 2013 8:16 AM, Dean Jones wrote: On 10 October 2013 12:46, Ted Du

Re: Naive bayes and character n-grams

2013-10-10 Thread Ted Dunning
Cool. Sounds like you are ahead of the game. Sent from my iPhone On Oct 10, 2013, at 13:15, Dean Jones wrote: > On 10 October 2013 12:46, Ted Dunning wrote: >> For language detection, you are going to have a hard time doing better than >> one of the standard packages for the purpose. See he

Re: Naive bayes and character n-grams

2013-10-10 Thread Dean Jones
On 10 October 2013 12:46, Ted Dunning wrote: > For language detection, you are going to have a hard time doing better than > one of the standard packages for the purpose. See here: > > http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html > Thanks for the pointer Ted. I

Re: Naive bayes and character n-grams

2013-10-10 Thread Ted Dunning
For language detection, you are going to have a hard time doing better than one of the standard packages for the purpose. See here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html On Thu, Oct 10, 2013 at 1:01 AM, Dean Jones wrote: > Hi Si, > > On 10 October 201

Re: Naive bayes and character n-grams

2013-10-10 Thread Dean Jones
Hi Si, On 10 October 2013 07:59, wrote: > > what do you mean by character n-grams? If you mean things like "&ab" or "ui2" then given that there are so few characters compared to words is there a problem that can't be solved without a look-up table for n > Or are you looking at y >4 ish because if

Re: Naive bayes and character n-grams

2013-10-10 Thread Dean Jones
Hi Suneel, On 9 October 2013 14:27, Suneel Marthi wrote: > an example of a Naive-Bayes classifier trained on character n-grams is the > LangDetect library. > (see http://code.google.com/p/language-detection/) > > Agree with Ted that it should be relatively easy to build one. > Thanks. Yes, I ne

RE: Naive bayes and character n-grams

2013-10-10 Thread simon.2.thompson
Hey Dean, what do you mean by character n-grams? If you mean things like "&ab" or "ui2" then given that there are so few characters compared to words is there a problem that can't be solved without a look-up table for n4 ish because if so then do you run into the issue of a sudden space explosi

Re: Naive bayes and character n-grams

2013-10-09 Thread Suneel Marthi
an example of a Naive-Bayes classifier trained on character n-grams is the LangDetect library. (see http://code.google.com/p/language-detection/) Agree with Ted that it should be relatively easy to build one. On Wednesday, October 9, 2013 6:40 AM, Ted Dunning wrote: Yes.  Should work to

Re: Naive bayes and character n-grams

2013-10-09 Thread Jens Bonerz
Hi Dean, i might be wrong. but try googling for "shingling"... could be something to start with. Cheers Jens 2013/10/9 Ted Dunning > Yes. Should work to use character n-grams. There are oddities in the > stats because the different n-grams are not independent, but Naive Bayes > methods are

Re: Naive bayes and character n-grams

2013-10-09 Thread Ted Dunning
Yes. Should work to use character n-grams. There are oddities in the stats because the different n-grams are not independent, but Naive Bayes methods are in such a state of sin that it shouldn't hurt any worse. No... I don't think that there is a capability built in to generate the character n-g