Re: [scikit-learn] How to deal with hierarchical and real-time analysis in machine learning?

2019-02-13 Thread Max Halford
Hey lampahome,

I'm currently working on an online learning library called creme:
https://creme-ml.github.io/. Each estimator and transformer has a
fit_one(x, y) method so that you can learn from a stream of data. I've
only been working on it for a bit less than a month now but it might
be of interest to you nonetheless. Maybe it will give you some ideas.
There's an introductory tutorial on GitHub.

Kind regards.

On 13/02/2019, lampahome  wrote:
> For example, I may have huge different regions and every regions have many
> or less points.
>
> And I also want to real-time to analyze the newest data and older data, but
> I don't want to put data into memory cuz I don't have enough memory.
>
> What I thought I can use is partial_fit to accept streaming data when new
> data comes in.
>
> But the incoming data has hierarchical, it's hard to cluster them cuz I
> don't have older and newer data together to cluster.
>
> How to design the system better?
>
> thx
>


-- 
Max Halford
+336 28 25 13 38
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] cross_validate() with HMM

2019-02-13 Thread Anni Bauer
Hi! I want to be able to run each fold of a k-fold cross validation fold in 
parallel, using all of my 6 CPUs at once. My model is a hidden markov model and 
I want to train it using the training portion of the data and then extract the 
anomaly score (negative log-likelihood) of each test sequence of the test 
portion with every fold and use ROC as an evaluation technique with every fold.

I have found the function cross_validate() which seems to provide the option of 
running things in parralel with n_jobs = -1.
I assume the estimator is then my HMM model.
As of now I'm using pomegranate to train the model and extract the anomaly 
score of the test sequences.
I don't understand how to call the cross_validate function with the right 
arguments for my HMM model. All examples I've seen havn't used HMM. I'm 
confused on where to specify the hidden states number if Im not callign my 
usual pomegranate function from_samples(), which I've used before.

Also how can I extract the anomay scores within each fold using this function?
I'm unsure what exactly is happening with in the cross_validate function and 
how to control it the way I need.

If anyone has an example or explanation or another idea on how to run the folds 
in parallel, I would really appreciate it!

This is my attempt of using cross_validate, which gets stuck or seems to not be 
running through (although I'm quite sure I'm not using it properly):

import pomegranate
import sklearn
model = pomegranate.HiddenMarkovModel()

results = cross_validate(model, listToUse, y=None, groups=None, scoring=None, 
cv=3, n_jobs=-1, verbose=10)

print(results)


Below is how I've manually set my cross-validation up as of now:

listExample = []
kfold = KFold(10, True)
for train, test in kfold.split(listToUse):
listExample.append([listToUse[train], listToUse[test]])

scoreList = []

for ex in listExample:

hmmModel = hmm.hmm(ex[0])
scoreListFold = []

mid = time.time()

for li in ex[1]:
prob = hmmModel.log_probability(li)
scoreListFold.append(prob)

scoreList.append(numpy.mean(scoreListFold))

avg = numpy.mean(scoreList)

Thanks again!

Anni
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Sprint discussion points?

2019-02-13 Thread Andreas Mueller

Hey all.

Should we collect some discussion points for the sprint?

There's an unusual amount of core-devs present and I think we should 
seize the opportunity.

Maybe we should create a page in the wiki or add it to the sprint page?

Things that are high on my list of priorities are:

 * slicing pipelines
 * add get_feature_names to pipelines
 * freezing estimator
 * faster multi-metric scoring
 * fit_transform doing something other than fit.transform
 * imbalance-learn interface / subsampling in pipelines
 * Specifying search spaces and valid hyper parameters
   (https://github.com/scikit-learn/scikit-learn/issues/13031).
 * allowing EstimatorCV-style speed-up in GridSearches
 * storing pandas column names and using them as feature names


Trying to discuss all of these might be too much, but maybe we can 
figure out a subset and make sure we have sleps to discuss?
Most of these issues are on the roadmap, issue 13031 is reladed to #18 
but not directly on the roadmap.


Thanks,
Andy
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Sprint discussion points?

2019-02-13 Thread Joel Nothman
Yes, I was thinking the same. I think there are some other core issues to
solve, such as:

* euclidean_distances numerical issues
* commitment to ARM testing and debugging
* logistic regression stability

We should also nut out OPTICS issues or remove it from 0.21. I'm still keen
on trying to work out sample props (supporting weighted scoring at least),
but perhaps I'm being persuaded this will never be a top-priority
requirement, and the solutions add much complexity.

On Thu, 14 Feb 2019 at 07:39, Andreas Mueller  wrote:

> Hey all.
>
> Should we collect some discussion points for the sprint?
>
> There's an unusual amount of core-devs present and I think we should seize
> the opportunity.
> Maybe we should create a page in the wiki or add it to the sprint page?
>
> Things that are high on my list of priorities are:
>
>- slicing pipelines
>- add get_feature_names to pipelines
>- freezing estimator
>- faster multi-metric scoring
>- fit_transform doing something other than fit.transform
>- imbalance-learn interface / subsampling in pipelines
>- Specifying search spaces and valid hyper parameters (
>https://github.com/scikit-learn/scikit-learn/issues/13031).
>- allowing EstimatorCV-style speed-up in GridSearches
>- storing pandas column names and using them as feature names
>
>
> Trying to discuss all of these might be too much, but maybe we can figure
> out a subset and make sure we have sleps to discuss?
> Most of these issues are on the roadmap, issue 13031 is reladed to #18 but
> not directly on the roadmap.
>
> Thanks,
> Andy
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Sprint discussion points?

2019-02-13 Thread Andreas Mueller
Do you have a reference for the logistic regression stability? Is it 
convergence warnings?


Happy to discuss the other two issues, though I feel they seem easier 
than most of what's on my list.


I have no idea what's going on with OPTICS tbh, and I'll leave it up to 
you and the others to decide whether that's something we should discuss.
I can try to read up and weigh in but that might not be the most 
effective way to do it.


the sample props is something I left out because I personally don't feel 
it's a priority compared to all the other things;
my students have basically no way to figure out what features the 
coefficients in their linear model correspond to, that seems a bit more 
important to me.


We can put it on the discussion list again, but I'm not super 
enthusiastic about it.


How should we prioritize things?


On 2/13/19 8:08 PM, Joel Nothman wrote:
Yes, I was thinking the same. I think there are some other core issues 
to solve, such as:


* euclidean_distances numerical issues
* commitment to ARM testing and debugging
* logistic regression stability

We should also nut out OPTICS issues or remove it from 0.21. I'm still 
keen on trying to work out sample props (supporting weighted scoring 
at least), but perhaps I'm being persuaded this will never be a 
top-priority requirement, and the solutions add much complexity.


On Thu, 14 Feb 2019 at 07:39, Andreas Mueller > wrote:


Hey all.

Should we collect some discussion points for the sprint?

There's an unusual amount of core-devs present and I think we
should seize the opportunity.
Maybe we should create a page in the wiki or add it to the sprint
page?

Things that are high on my list of priorities are:

  * slicing pipelines
  * add get_feature_names to pipelines
  * freezing estimator
  * faster multi-metric scoring
  * fit_transform doing something other than fit.transform
  * imbalance-learn interface / subsampling in pipelines
  * Specifying search spaces and valid hyper parameters
(https://github.com/scikit-learn/scikit-learn/issues/13031).
  * allowing EstimatorCV-style speed-up in GridSearches
  * storing pandas column names and using them as feature names


Trying to discuss all of these might be too much, but maybe we can
figure out a subset and make sure we have sleps to discuss?
Most of these issues are on the roadmap, issue 13031 is reladed to
#18 but not directly on the roadmap.

Thanks,
Andy
___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Sprint discussion points?

2019-02-13 Thread Joel Nothman
Convergence in logistic regression (
https://github.com/scikit-learn/scikit-learn/issues/11536) is indeed one
problem (and it presents a general issue of what max_iter means when you
have several solvers, or how good defaults are selected). But I was sure we
had problems with non-determinism on some platforms... but now can't find.

> my students have basically no way to figure out what features the
coefficients in their linear model correspond to, that seems a bit more
important to me.

Yes, I agree... Assuming coefficients are helpful, rather than using
permutation-based measures of importance, for instance.

I generally think a review of distances might be a good thing at some
point, given the confusing triplication across sklearn.neighbors,
sklearn.metrics.pairwise, scipy.spatial... and that minkowski,p=2 is not
implemented the same as euclidean.


On Thu, 14 Feb 2019 at 12:56, Andreas Mueller  wrote:

> Do you have a reference for the logistic regression stability? Is it
> convergence warnings?
>
> Happy to discuss the other two issues, though I feel they seem easier than
> most of what's on my list.
>
> I have no idea what's going on with OPTICS tbh, and I'll leave it up to
> you and the others to decide whether that's something we should discuss.
> I can try to read up and weigh in but that might not be the most effective
> way to do it.
>
> the sample props is something I left out because I personally don't feel
> it's a priority compared to all the other things;
> my students have basically no way to figure out what features the
> coefficients in their linear model correspond to, that seems a bit more
> important to me.
>
> We can put it on the discussion list again, but I'm not super enthusiastic
> about it.
>
> How should we prioritize things?
>
>
> On 2/13/19 8:08 PM, Joel Nothman wrote:
>
> Yes, I was thinking the same. I think there are some other core issues to
> solve, such as:
>
> * euclidean_distances numerical issues
> * commitment to ARM testing and debugging
> * logistic regression stability
>
> We should also nut out OPTICS issues or remove it from 0.21. I'm still
> keen on trying to work out sample props (supporting weighted scoring at
> least), but perhaps I'm being persuaded this will never be a top-priority
> requirement, and the solutions add much complexity.
>
> On Thu, 14 Feb 2019 at 07:39, Andreas Mueller  wrote:
>
>> Hey all.
>>
>> Should we collect some discussion points for the sprint?
>>
>> There's an unusual amount of core-devs present and I think we should
>> seize the opportunity.
>> Maybe we should create a page in the wiki or add it to the sprint page?
>>
>> Things that are high on my list of priorities are:
>>
>>- slicing pipelines
>>- add get_feature_names to pipelines
>>- freezing estimator
>>- faster multi-metric scoring
>>- fit_transform doing something other than fit.transform
>>- imbalance-learn interface / subsampling in pipelines
>>- Specifying search spaces and valid hyper parameters (
>>https://github.com/scikit-learn/scikit-learn/issues/13031).
>>- allowing EstimatorCV-style speed-up in GridSearches
>>- storing pandas column names and using them as feature names
>>
>>
>> Trying to discuss all of these might be too much, but maybe we can figure
>> out a subset and make sure we have sleps to discuss?
>> Most of these issues are on the roadmap, issue 13031 is reladed to #18
>> but not directly on the roadmap.
>>
>> Thanks,
>> Andy
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> ___
> scikit-learn mailing 
> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn