That would be best, but practically speaking, randomizing once is usually OK. With a tiny data set like this that is in memory anyway, I wouldn't take any chances.
On Fri, Aug 31, 2012 at 9:08 PM, Lance Norskog <goks...@gmail.com> wrote: > "Try passing through the data 100 times for a start. " > > And randomize the order each time? > > On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood <sal...@influestor.com> > wrote: > > Cheers ted. Appreciate the input! > > > > Sent from my iPhone > > > > On 31 Aug 2012, at 17:53, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > >> OK. > >> > >> Try passing through the data 100 times for a start. I think that this > is > >> likely to fix your problems. > >> > >> Be warned that AdaptiveLogisticRegression has been misbehaving lately > and > >> may converge faster than it should. > >> > >> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <sal...@influestor.com > >wrote: > >> > >>> Thanks a lot ted. Here are the answers: > >>> d) Data (news articles from different feeds) > >>> News Article 1: Title : BP Profits Plunge On Massive Asset > >>> Write-down > >>> Description :BP PLC (BP) Tuesday > >>> posted a dramatic fall of 96% in adjusted profit for the > >>> second quarter as it wrote down the value of its assets by $5 billion > >>> including some U.S. refineries a suspended Alaskan oil project and U.S. > >>> shale gas resources > >>> > >>> News Article 2: Title : Morgan Stanley Missed Big > >>> Description: Why It's Still A > >>> Fantastic Short,"By Mike Williams: Though the market responded very > >>> positively to Citigroup (C) and Bank of America's (BAC) reserve > >>> release-driven earnings ""beats"" last week's Morgan Stanley (MS) > earnings > >>> report illustrated what happens when a bank doesn't have billions of > >>> reserves to release back into earnings. Estimates called for the > following: > >>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt > value > >>> adjustment) $7.7 billion in revenue GAAP results (including the DVA) > came > >>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a > >>> particular disappointment coming in at $6.95 billion. > >>> > >>> c) As you can see the data is textual. and I am using title and > >>> description as predictor variable and the target variable is the > company > >>> name a news belongs to. > >>> > >>> b) I am passing through the data once (at least this is what I think). > I > >>> folowed the 20newsgroup example code(in java) and dint find that the > data > >>> was passed more than once. > >>> Yes I randomize the order every time. > >>> > >>> a) I am using AdaptiveLearningRegression (just like 20newsgroup). > >>> > >>> Thanks! > >>> > >>> > >>> > >>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote: > >>> > >>>> First, this is a tiny training set. You are well outside the intended > >>>> application range so you are likely to find less experience in the > >>>> community in that range. That said, the algorithm should still > produce > >>>> reasonably stable results. > >>>> > >>>> Here are a few questions: > >>>> > >>>> a) which class are you using to train your model? I would start with > >>>> OnlineLogisticRegression and experiment with training rate schedules > and > >>>> amount of regularization to find out how to build a good model. > >>>> > >>>> b) how many times are you passing through your data? Do you randomize > >>> the > >>>> order each time? These are critical to proper training. Instead of > >>>> randomizing order, you could just sample a data point at random and > not > >>>> worry about using a complete permutation of the data. With such a > tiny > >>>> data set, you will need to pass through the data many times ... > possibly > >>>> hundreds of times or more. > >>>> > >>>> c) what kind of data do you have? Sparse? Dense? How many > variables? > >>>> What kind? > >>>> > >>>> d) can you post your data? > >>>> > >>>> > >>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood < > sal...@influestor.com > >>>> wrote: > >>>> > >>>>> Thanks a lot lance. Let me elaborate the problem if it was a bit > >>> confusing. > >>>>> > >>>>> Assuming I am making a binary classifier using SGD. I have got 50 > >>> positive > >>>>> and 50 negative examples to train the classifier. After training and > >>>>> testing the model, the confusion matrix tells you the number of > >>> correctly > >>>>> and incorrectly classified instances. Let's assume I got 85% correct > and > >>>>> 15% incorrect instances. > >>>>> > >>>>> Now if I run my program again using the same 50 negative and 50 > positive > >>>>> examples, then according to my knowledge the classifier should yield > the > >>>>> same results as before (cause not even a single training or testing > data > >>>>> was changed), but this is not the case. I get different results for > >>>>> different runs. The confusion matrix figures changes each time I > >>> generate a > >>>>> model keeping the data constant. What I do is, I generate a model > >>> several > >>>>> times and keep a look for the accuracy, and if it is above 90%, then > I > >>> stop > >>>>> running the code and hence an accurate model is created. > >>>>> > >>>>> So what you are saying is to shuffle my data before I use it for > >>> training > >>>>> and testing? > >>>>> Thanks! > >>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote: > >>>>> > >>>>>> Now I remember: SGD wants its data input in random order. You need > to > >>>>>> permute the order of your data. > >>>>>> > >>>>>> If that does not help, another trick: for each data point, randomly > >>>>>> generate 5 or 10 or 20 points which are close. And again, randomly > >>>>>> permute the entire input set. > >>>>>> > >>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <goks...@gmail.com> > >>>>> wrote: > >>>>>>> The more data you have, the closer each run will be. How much data > do > >>>>> you have? > >>>>>>> > >>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood < > >>> sal...@influestor.com> > >>>>> wrote: > >>>>>>>> I have noticed that every time I train and test a model using the > >>> same > >>>>> data (in SGD algo), I get different confusion matrix. Meaning, if I > >>>>> generate a model and look at the confusion matrix, it might say 90% > >>>>> correctly classified instances, but if I generate the model again > (with > >>> the > >>>>> SAME data for training and testing as before) and test it, the > confusion > >>>>> matrix changes and it might say 75% correctly classified instances. > >>>>>>>> > >>>>>>>> Is this a desired behavior? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Lance Norskog > >>>>>>> goks...@gmail.com > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Lance Norskog > >>>>>> goks...@gmail.com > >>>>> > >>>>> > >>> > >>> > > > > -- > Lance Norskog > goks...@gmail.com >