Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 38

Nicolas Hug Fri, 25 Oct 2019 04:37:32 -0700

It's in the making for the new histogram-based GB estimators, but theother GB estimators like GradientBoostingRegressor andGradientBoostingClassifier already support sample_weight. Just pass theweights in the fit method:https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.fit


On 10/25/19 3:39 AM, Adrin wrote:

Hi,

it's in the making:https://github.com/scikit-learn/scikit-learn/pull/14696

On Fri, Oct 25, 2019 at 4:23 AM WONG Wing Mei<[email protected] <mailto:[email protected]>> wrote:


    Can I ask whether we can use sample weight in gradient boosting?
    And how to do it?

    -----Original Message-----
    From: scikit-learn [mailto:scikit-learn-bounces+wong.wingmei
    <mailto:scikit-learn-bounces%2Bwong.wingmei>[email protected]
    <mailto:[email protected]>] On Behalf Of
    [email protected]
    <mailto:[email protected]>
    Sent: Friday, October 25, 2019 12:00 AM
    To: [email protected] <mailto:[email protected]>
    Subject: scikit-learn Digest, Vol 43, Issue 38

    Send scikit-learn mailing list submissions to
    [email protected] <mailto:[email protected]>

    To subscribe or unsubscribe via the World Wide Web, visit
    https://mail.python.org/mailman/listinfo/scikit-learn
    or, via email, send a message with subject or body 'help' to
    [email protected]
    <mailto:[email protected]>

    You can reach the person managing the list at
    [email protected] <mailto:[email protected]>

    When replying, please edit your Subject line so it is more specific
    than "Re: Contents of scikit-learn digest..."


    Today's Topics:

       1. Re: Decision tree results sometimes different with scaled
          data (Alexandre Gramfort)
       2. Reminder: Monday October 28th meeting (Adrin)


    ----------------------------------------------------------------------

    Message: 1
    Date: Thu, 24 Oct 2019 14:09:01 +0200
    From: Alexandre Gramfort <[email protected]
    <mailto:[email protected]>>
    To: Scikit-learn mailing list <[email protected]
    <mailto:[email protected]>>
    Subject: Re: [scikit-learn] Decision tree results sometimes different
            with scaled data
    Message-ID:
           
    <cadeotzrh_bxhaqv6wdnrout4zxw_+eobj6_vmwma50anahk...@mail.gmail.com
    
<mailto:cadeotzrh_bxhaqv6wdnrout4zxw_%[email protected]>>
    Content-Type: text/plain; charset="utf-8"

    another reason is that we take as threshold the mid point between
    sample
    values
    which is not invariant to arbitrary scaling of the features

    Alex



    On Tue, Oct 22, 2019 at 11:56 AM Guillaume Lema?tre
    <[email protected] <mailto:[email protected]>>
    wrote:

    > Even with the same random state, it can happen that several
    features will
    > lead to a best split and this split is chosen randomly (even
    with the seed
    > fixed - this is reported as an issue I think). Therefore, the
    rest of the
    > tree could be different leading to different prediction.
    >
    > Another possibility is that we compute the difference between
    the current
    > threshold and the next to be tried and only check the entropy if
    it is
    > larger than a specific value (I would need to check the source
    code). After
    > scaling, it could happen that 2 feature values become too closed
    to be
    > considered as a potential split which will make a difference
    between scaled
    > and scaled features. But this diff should be really small.
    >
    > This is the what I can think on the top of the head.
    >
    > Sent from my phone - sorry to be brief and potential misspell.
    > *From:* [email protected]
    <mailto:[email protected]>
    > *Sent:* 22 October 2019 11:34
    > *To:* [email protected] <mailto:[email protected]>
    > *Reply to:* [email protected] <mailto:[email protected]>
    > *Subject:* [scikit-learn] Decision tree results sometimes
    different with
    > scaled data
    >
    > Hi all,
    >
    > First, let me thank you for the great job your guys are doing
    developing
    > and maintaining such a popular library!
    >
    > As we all know decision trees are not impacted by scaled data
    because
    > splits don't take into account distances between two values within a
    > feature.
    >
    > However I experienced a strange behavior using sklearn decision tree
    > algorithm.  Sometimes results of the model are different
    depending if input
    > data has been scaled or not.
    >
    > To illustrate my point I ran experiments on the iris dataset
    consisting of:
    >
    >    - perform a train/test split
    >    - fit the training set and predict the test set
    >    - fit and predict again with standardized inputs (removing
    the mean
    >    and scaling to unit variance)
    >    - compare both model predictions
    >
    > Experiments have been ran 10,000 times with different random
    seeds (cf.
    > traceback and code to reproduce it at the end).
    > Results showed that for a bit more than 10% of the time we find
    at least
    > one different prediction. Hopefully when it's the case only a few
    > predictions differ, 1 or 2 most of the time. I checked the
    inputs causing
    > different predictions and they are not the same depending of the
    run.
    >
    > I'm worried if the rate of different predictions could be larger
    for other
    > datasets...
    > Do you have an idea where it come from, maybe due to floating
    point errors
    > or am I doing something wrong?
    >
    > Cheers,
    > Geoffrey
    >
    >
    > ------------------------------------------------------------
    > Traceback:
    > ------------------------------------------------------------
    > Error rate: 12.22%
    >
    > Seed: 241862
    > All pred equal: False
    > Not scale data confusion matrix:
    > [[16  0  0]
    > [ 0 17  0]
    > [ 0  4 13]]
    > Scale data confusion matrix:
    > [[16  0  0]
    > [ 0 15  2]
    > [ 0  4 13]]
    > ------------------------------------------------------------
    > Code:
    > ------------------------------------------------------------
    > import numpy as np
    >
    > from sklearn.datasets import load_iris
    > from sklearn.metrics import confusion_matrix
    > from sklearn.model_selection import train_test_split
    > from sklearn.preprocessing import StandardScaler
    > from sklearn.tree import DecisionTreeClassifier
    >
    >
    > X, y = load_iris(return_X_y=True)
    >
    > def run_experiment(X, y, seed):
    >     X_train, X_test, y_train, y_test = train_test_split(
    >             X,
    >             y,
    >             stratify=y,
    >             test_size=0.33,
    >             random_state=seed
    >         )
    >
    >     scaler = StandardScaler()
    >
    >     X_train_scaled = scaler.fit_transform(X_train)
    >     X_test_scaled = scaler.transform(X_test)
    >
    >     clf = DecisionTreeClassifier(random_state=seed)
    >     clf_scaled = DecisionTreeClassifier(random_state=seed)
    >
    >     clf.fit(X_train, y_train)
    >     clf_scaled.fit(X_train_scaled, y_train)
    >
    >     pred = clf.predict(X_test)
    >     pred_scaled = clf_scaled.predict(X_test_scaled)
    >
    >     err = 0 if all(pred == pred_scaled) else 1
    >
    >     return err, y_test, pred, pred_scaled
    >
    >
    > n_err, n_run, seed_err = 0, 10000, None
    >
    > for _ in range(n_run):
    >     seed = np.random.randint(10000000)
    >     err, _, _, _ = run_experiment(X, y, seed)
    >     n_err += err
    >
    >     # keep aside last seed causing an error
    >     seed_err = seed if err == 1 else seed_err
    >
    >
    > print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n')
    >
    > _, y_test, pred, pred_scaled = run_experiment(X, y, seed_err)
    >
    > print(f'Seed: {seed_err}')
    > print(f'All pred equal: {all(pred == pred_scaled)}')
    > print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test,
    > pred)}')
    > print(f'Scale data confusion matrix:\n{confusion_matrix(y_test,
    > pred_scaled)}')
    > [image: Sent from Mailspring]
    > _______________________________________________
    > scikit-learn mailing list
    > [email protected] <mailto:[email protected]>
    > https://mail.python.org/mailman/listinfo/scikit-learn
    >
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL:
    
<http://mail.python.org/pipermail/scikit-learn/attachments/20191024/87feea0d/attachment-0001.html>

    ------------------------------

    Message: 2
    Date: Thu, 24 Oct 2019 17:10:26 +0200
    From: Adrin <[email protected] <mailto:[email protected]>>
    To: Scikit-learn mailing list <[email protected]
    <mailto:[email protected]>>
    Subject: [scikit-learn] Reminder: Monday October 28th meeting
    Message-ID:
           
    <caeorw48htwpxlwz2daksbas5utepg6kc_xgrwwvtdocvtd7...@mail.gmail.com
    <mailto:caeorw48htwpxlwz2daksbas5utepg6kc_xgrwwvtdocvtd7...@mail.gmail.com>>
    Content-Type: text/plain; charset="utf-8"

    Hi Scikit-learn people,

    This is a reminder that we'll be having our monthly call on Monday.

    Please put your thoughts and important topics you have in mind on
    the project board:
    https://github.com/scikit-learn/scikit-learn/projects/15

    We'll be meeting on https://appear.in/amueller

    As usual, it'd be nice to have them on the board before the weekend :)

    See you on Monday,
    Adrin.
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL:
    
<http://mail.python.org/pipermail/scikit-learn/attachments/20191024/377798b6/attachment-0001.html>

    ------------------------------

    Subject: Digest Footer

    _______________________________________________
    scikit-learn mailing list
    [email protected] <mailto:[email protected]>
    https://mail.python.org/mailman/listinfo/scikit-learn


    ------------------------------

    End of scikit-learn Digest, Vol 43, Issue 38
    ********************************************
    UOB EMAIL DISCLAIMER
    Any person receiving this email and any attachment(s) contained,
    shall treat the information as confidential and not misuse, copy,
    disclose, distribute or retain the information in any way that
    amounts to a breach of confidentiality. If you are not the intended
    recipient, please delete all copies of this email from your computer
    system. As the integrity of this message cannot be guaranteed,
    neither UOB nor any entity in the UOB Group shall be responsible for
    the contents. Any opinion in this email may not necessarily represent
    the opinion of UOB or any entity in the UOB Group.

    _______________________________________________
    scikit-learn mailing list
    [email protected] <mailto:[email protected]>
    https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 38

Reply via email to