Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
Originally, it was this technique was used to estimate a sampling distribution. Think of the drawing with replacement as work-around for generating *new* data from a population that is simulated by this repeated sampling from the given dataset with replacement. For more details, I’d recommend

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Ibrahim Dalal via scikit-learn
So what is the point of having duplicate entries in your training set? This seems just a pure overhead. Sorry but you will again have to help me here. On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka wrote: > > Hi, > > > > That helped a lot. Thank you very much. I have

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
> Hi, > > That helped a lot. Thank you very much. I have one more (silly?) doubt though. > > Won't an n-sized bootstrapped sample have repeated entries? Say we have an > original dataset of size 100. A bootstrap sample (say, B) of size 100 is > drawn from this set. Since 32 of the original

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Ibrahim Dalal via scikit-learn
Hi, That helped a lot. Thank you very much. I have one more (silly?) doubt though. Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread Andreas Mueller
Congrats, hope to see lot's more ;) On 10/03/2016 12:09 PM, Raghav R V wrote: Thanks everyone! Looking forward to contributing more :D On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose > wrote: congrats! :) On Mon, Oct 3, 2016 at

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via import matplotlib.pyplot as plt vs = [] for n in range(5, 201, 5): v = 1 - (1. - 1./n)**n vs.append(v) plt.plot([n for n in range(5, 201, 5)], vs, marker='o', markersize=6, alpha=0.5,)

Re: [scikit-learn] Generate data from trained naive bayes

2016-10-03 Thread klo uo
Great. Thanks for your time Manoj Cheers, Klo On Mon, Oct 3, 2016 at 8:20 PM, Manoj Kumar wrote: > Let's say you would like to generate just the first feature of 1000 > samples with label 0. > > The distribution of the first feature conditioned on label 1

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is P(not_chosen) = (1 - 1\n)^n Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Алексей Драль
Hi, >From docs http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html : The RandomForestClassifier is trained using bootstrap aggregation, where each new tree is fit from a bootstrap sample of the training observations z_i = (x_i, y_i). The out-of-bag (OOB) error is the

Re: [scikit-learn] Random Forest with Bootstrapping

2016-10-03 Thread Sebastian Raschka
> From whatever little knowledge I gained last night about Random Forests, each > tree is trained with a sub-sample of original dataset (usually with > replacement)?. Yes, that should be correct! > Now, what I am not able to understand is - if entire dataset is used to train > each of the

Re: [scikit-learn] Generate data from trained naive bayes

2016-10-03 Thread Manoj Kumar
Let's say you would like to generate just the first feature of 1000 samples with label 0. The distribution of the first feature conditioned on label 1 follows a Bernoulli distribution (as suggested by the name) with parameter "exp(feature_log_prob_[0, 0])". You could then generate the first

Re: [scikit-learn] Generate data from trained naive bayes

2016-10-03 Thread klo uo
Hi Manoj, thanks for your reply. Sorry to say, but I don't understand how to generate new feature. In this example I have `X` with shape (1000, 64) with 5 unique classes. `feature_log_prob_` has shape (5, 64) I can generate for example uniform data with `r = np.random.rand(64)` Now how can I

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread Jacob Schreiber
Congrats Raghav! On Mon, Oct 3, 2016 at 10:06 AM, Sebastian Raschka wrote: > Congrats Raghav! And thanks a lot for all the great work on the > model_selection module! > > > On Oct 3, 2016, at 12:53 PM, Siddharth Gupta < > siddharthgupta...@gmail.com> wrote: > > > >

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread Sebastian Raschka
Congrats Raghav! And thanks a lot for all the great work on the model_selection module! > On Oct 3, 2016, at 12:53 PM, Siddharth Gupta > wrote: > > Congrats Raghav! :D > > > On Oct 3, 2016 10:22 PM, "Aakash Agarwal" wrote: > Congrats

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread Manoj Kumar
Congratulations! On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux wrote: > Congratulation Raghav! > > On 3 October 2016 at 08:40, Ronnie Ghose wrote: > > congrats! :) > > > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread Nelle Varoquaux
Congratulation Raghav! On 3 October 2016 at 08:40, Ronnie Ghose wrote: > congrats! :) > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen > wrote: >> >> Congrats, Raghav! >> >> Nelson Liu 於 2016年10月3日 週一 下午11:27寫道: >>> >>> Yay!

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread Ronnie Ghose
congrats! :) On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen wrote: > Congrats, Raghav! > > Nelson Liu 於 2016年10月3日 週一 下午11:27寫道: > >> Yay! Congrats, Raghav! >> >> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < >> gael.varoqu...@normalesup.org> wrote: >>

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread Krishna Kalyan
Congrats Raghav. :) On Mon, Oct 3, 2016 at 5:28 PM, lin yenchen wrote: > Congrats, Raghav! > > Nelson Liu 於 2016年10月3日 週一 下午11:27寫道: > >> Yay! Congrats, Raghav! >> >> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < >> gael.varoqu...@normalesup.org>

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread lin yenchen
Congrats, Raghav! Nelson Liu 於 2016年10月3日 週一 下午11:27寫道: > Yay! Congrats, Raghav! > > On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < > gael.varoqu...@normalesup.org> wrote: > > Hi, > > We have the pleasure to welcome Raghav RV to the core-dev team. Raghav > (@raghavrv) has been

Re: [scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread Nelson Liu
Yay! Congrats, Raghav! On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < gael.varoqu...@normalesup.org> wrote: > Hi, > > We have the pleasure to welcome Raghav RV to the core-dev team. Raghav > (@raghavrv) has been working on scikit-learn for more than a year. In > particular, he implemented the

Re: [scikit-learn] Generate data from trained naive bayes

2016-10-03 Thread Manoj Kumar
Hi, feature_log_prob_ is an array of size (n_classes, n_features). exp(feature_log_prob_[class_ind, feature_ind]) gives P(X_{feature_ind} = 1 | class_ind)" Using the conditional independence assumptions of NaiveBayes, you can use this to sample each feature independently given the class. Hope

[scikit-learn] Welcome Raghav to the core-dev team

2016-10-03 Thread Gael Varoquaux
Hi, We have the pleasure to welcome Raghav RV to the core-dev team. Raghav (@raghavrv) has been working on scikit-learn for more than a year. In particular, he implemented the rewrite of the cross-validation utilities, which is quite dear to my heart. Welcome Raghav! Gaël

Re: [scikit-learn] Generate data from trained naive bayes

2016-10-03 Thread klo uo
On Mon, Oct 3, 2016 at 5:08 PM, klo uo wrote: > I can see how can I sample from `feature_log_prob_`... > I meant I cannot see ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] sample_weight for cohen_kappa_score

2016-10-03 Thread Andreas Mueller
Hm it sounds like "weights" should have been called "weighting" maybe? Not sure if it's worth changing now, as we released it already. And I think passing the weighting to the confusion matrix is correct. There should be tests for weighted metrics to confirm that. PR welcome. On 10/03/2016

Re: [scikit-learn] Generate data from trained naive bayes

2016-10-03 Thread Andreas Mueller
Hi Klo. Yes, you could, but as the model is very simple, that's usually not very interesting. It stores for each label an independent Bernoulli distribution for each feature. these are stored in feature_log_prob_. I would suggest you look at this attribute, rather than sample from the