2013/1/24 O. B. <[email protected]>:
> Sorry I forgot the mention:
>
> Scikit's Logistic Regression is incredibly fast compared to Weka. Weka's
> implementation (mostly based on this paper) is slow as well as VERY memory
> intensive. Sometimes it wasn't enough to allocate 3 GB as heap size. My
> dataset (words in above have not more than 100 instance) is very small
> because I use LR word by word.
>
> Is this the case because scikit's LR uses liblinear library?
>
> Thank you
>
> On Thu, Jan 24, 2013 at 5:25 PM, O. B. <[email protected]> wrote:
>>
>> Hello all,
>>
>> I have some problem with my experiments. I used Logistic Regression (LR)
>> to classify words senses. We have gold tags for (target set) each word
>> instance.
>>
>> I did 10 fold cross validation. Some words in my dataset have more than
>> two senses so I wrapped logistic regression with OneVsRestClassifier.

You don't need to wrap LogisticRegression in a OneVsRestClassifier
object as it's already using OvR / OvA for handling multiclass
internally as explained in the doc:

http://scikit-learn.org/dev/modules/multiclass.html

> The
>> code is here. Accuracy was not impressive and so I suspect if there was an
>> error in my code.  So I picked five words to classify using LR on Weka. I
>> used default settings on Weka

You should never use the default settings of a classifier to compare
scores. Always grid search the optimal values of the most impacting
hyperparameters. In the case of LogisticRegression you should grid
search the regularization parameter which is named 'C'.

Here is the documentation for grid search:

  http://scikit-learn.org/dev/modules/grid_search.html

> and these are the results:
>>
>>                WORD                    Scikit                Weka
>>
>> accommodate             0.3                   0.667
>> bow                              0.05                 0.681818
>> display                         0.475               0.70
>> haunt                            0.575               0.53
>> owe                              0.2533             0.4375
>
>>
>> This are the (correct_label / total_label) scores. Except haunt, scores
>> are not consistent and scikit's are significantly lower than Weka. I do not
>> say scikit has a bug or something, most likely there is a problem in my code
>> or Weka makes some pre-processing instead of using raw data directly. Could
>> you explain why is there a huge differences between Scikit and Weka scores.
>>
>> Every features have sum to 1 and their values are between 0 and 1.

Do you mean each feature vector sum to 1, right?

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to