Re: [scikit-learn] Why ridge regression can solve multicollinearity?

2020-01-08 Thread lampahome
Stuart Reynolds 於 2020年1月9日 週四 上午10:33寫道: > Correlated features typically have the property that they are tending to > be similarly predictive of the outcome. > > L1 and L2 are both a preference for low coefficients. > If a coefficient can be reduced yet another coefficient maintains similar > lo

[scikit-learn] Why ridge regression can solve multicollinearity?

2020-01-08 Thread lampahome
I find out many blogs said that the l2 regularization solve multicollinearity, but they don't said how it works. I thought LASSO is able to select features by l1 regularization, maybe it also can solve this. Can anyone tell me how ridge works with multicollinearity great? thx ___

[scikit-learn] Is there possible to combine multiple patterns in one regression model?

2019-10-31 Thread lampahome
I have an idea to predict usage of every blocks of one disk, and I found pattern of blocks are related with time. Ex: block index 0~100 have high access times at 00:00, 12:00, and 18:00 and for 10 minutes. other block index 1000~1100 have high access times at 05:00, 14:00, and 20:00 and for 10 min

Re: [scikit-learn] Any recommend way to encode IP address?

2019-08-20 Thread lampahome
Chris Aridas 於 2019年8月16日 週五 下午5:26寫道: > It was just an idea about how you can extract features from IP addresses, > not a direction to use that service. > > If I just encode the ip address, is there any efficient way? What I found reliable is arithmetic encoding and convert ip string to integer

Re: [scikit-learn] Any recommend way to encode IP address?

2019-08-16 Thread lampahome
Chris Aridas 於 2019年8月16日 週五 下午3:56寫道: > Hey, > > Apart from encoding you could use feature engineering. Something like this > https://ipgeolocation.io/documentation/ip-geolocation-api.html > Two IPs might have the same country but different city. So, you could mix > and match whatever you want.

[scikit-learn] Any recommend way to encode IP address?

2019-08-16 Thread lampahome
I collect data which has many access log from different IP. But I don't know what's the better way to encode it to make sure small size of train data and keep the independency of different IPs. 1. one-hot encode: If too many IP, the train data will occupy huge disk spaces. 2. category encode: IP

[scikit-learn] Any machine learning used in storage company?

2019-07-19 Thread lampahome
Is there any application used in storage company? Can anyone briefly introduce what application in what company? thx ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] Can I pre-calculate parameter threshold of Birch?

2019-07-08 Thread lampahome
The threshold is determined by the sphere and simulate the points into a sphere. When I tune parameters, I don't know how to set the range of threshold to tune. Can I pre-calculate the threshold? ___ scikit-learn mailing list scikit-learn@python.org htt

[scikit-learn] What's the principle of partial_fit?

2019-07-01 Thread lampahome
I work with partial_fit of Birch because the dataset is too huge to load into memory. So I cluster data batch by batch. eg: I have 5 samples and every batch contain 1000 samples. I found clustering result is better if I cluster data which contain part of last batch better than cluster data wh

[scikit-learn] Any drawbacks when using partial_fit?

2019-06-27 Thread lampahome
I try to use Birch to cluster time-series data incrementally. Because insufficient memory, so I train it batch by batch. Every batch is 1000 samples and for 50 batch. I found when I only train the first batch, it cluster well. After first trained, I train following batch with the same model and

Re: [scikit-learn] Any way to pre-calculate number of cluster roughly?

2019-06-26 Thread lampahome
Jamie Bull 於 2019年6月26日 週三 下午11:02寫道: > A common rule of thumb is number of clusters = sqrt(number of items/2) > http://www.ijarcsms.com/docs/paper/volume1/issue6/V1I6-0015.pdf > >> >> If I found it the number is too much, how to merge those groups? Calculate each silhouette score of groups or el

[scikit-learn] Any way to pre-calculate number of cluster roughly?

2019-06-26 Thread lampahome
I see many ways like elbow method, silhouette score, they all define the cluster number after clustering. Especially the elbow method, I need to monitor the relation with cluster number and find the elbow. But if the dataset is too huge to let me find the elbow and I don't even how many cluster n

[scikit-learn] Is there any general way to make clustering huge time-series dataset better?

2019-06-20 Thread lampahome
I have a huge time-series dataset and should load batch by batch. My procedures like below: Scale to (0~1) Shuffle (because I use Birch not MiniBatchKMeans) Train Birch model with partial_fit Evaluate with silhouette_score (large is better) Why I use Birch is because it have partial_fit and no ne

Re: [scikit-learn] How to tune parameters when using partial_fit

2019-06-11 Thread lampahome
I know there's no built-in way to tune parameter batch by batch. I'm curious about is there any suitable/general way to tune parameters batch by batch? Because the distribution is not easy to know when the dataset is too large to load into memory. ___ sc

[scikit-learn] How to tune parameters when using partial_fit

2019-06-10 Thread lampahome
as title, I try to cluster a huge data, but I don't know how to tune parameters when clustering. If it's a small dataset, I can use gridsearchcv, but how to if it's a huge data? thx ___ scikit-learn mailing list scikit-learn@python.org https://mail.pyt

Re: [scikit-learn] fit before partial_fit ?

2019-06-09 Thread lampahome
federico vaggi 於 2019年6月7日 週五 上午1:08寫道: > k-means isn't a convex problem, unless you freeze the initialization, you > are going to get very different solutions (depending on the dataset) with > different initializations. > > Nope, I specify the random_state=0. u can try it. >>> x = np.array([[1,

[scikit-learn] Tune parameters when I need to load data segment by segment?

2019-06-09 Thread lampahome
As title I have one huge data to load, so I need to train it incrementally. So I load data segment by segment and train segment by segment like: MiniBatchKMeans. In that condition, how to tune parameters? tune the first part of data or every part of data?

[scikit-learn] fit before partial_fit ?

2019-06-06 Thread lampahome
I tried MiniBatchKMeans with two order: fit -> partial_fit partial_fit -> partial_fit The clustering results are different what's their difference? ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-lea

[scikit-learn] Any way to tune threshold of Birch rather than GridSearchCV?

2019-06-05 Thread lampahome
I use Birch to cluster my data and my data is kind of time-series data. I don't know the actually cluster numbers and need to read large data(online learning), so I choose Birch rather than MiniKmeans. When I read it, I found the critical parameters might be branching_factor and threshold, and th

[scikit-learn] MemoryError when evaluate clustering with gridsearchcv

2019-05-30 Thread lampahome
I read a large data into memory and it cost about 2GB ram(I have 4GB ram) Size get from sys.getsizeof(train_X) *63963248* And I evalute clustering with gridsearchcv below: def grid_search_clu(X): def cv_scorer(estimator, X): estimator.fit(X) cluster_labels = estimator.labels_ if hasattr(estimato

Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-15 Thread lampahome
Joel Nothman 於 2019年5月15日 週三 下午12:16寫道: > Evaluating on large datasets is easy if the sufficient statistics are just > the contingency matrix. > > Sorry, I don't understand it. Can you explain detailly? You mean we could take subset of samples to evaluating if subset is contingency(normal dist

Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-13 Thread lampahome
Uri Goren 於 2019年5月3日 週五 下午7:29寫道: > I usually use clustering to save costs on labelling. > I like to apply hierarchical clustering, and then label a small sample and > fine-tune the clustering algorithm. > > That way, you can evaluate the effectiveness in terms of cluster purity > (how many clus

[scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-03 Thread lampahome
I see some algo can cluster incrementally if dataset is too huge ex: minibatchkmeans and Birch. But is there any way to evaluate incrementally? I found silhouette-coefficient and Calinski-Harabaz index because I don't know the ground truth labels. But they can't evaluate incrementally. __

[scikit-learn] Any other clustering algo cluster incrementally?

2019-04-30 Thread lampahome
I read this : https://scikit-learn.org/0.15/modules/scaling_strategies.html There's only one clustering algo cluster incrementally, that is minibatch kmeans. Is there any clustering algo can reach this? On github is okay. thanks. ___ scikit-learn maili

[scikit-learn] Can cluster help me to cluster data with length of continuous series?

2019-04-03 Thread lampahome
I have data which contain access duration of each items. EX: t0~t4 is the access time duration. 1 means the item was accessed in the time duration, 0 means not. ID,t0,t1,t2,t3,t4 0,1,0,0,1 1,1,0,0,1 2,0,0,1,1 3,0,1,1,1 What I want to cluster is the length of continuous duration Ex: ID=3 > 2 > 1 =

[scikit-learn] Can cluster based on the continuous access duration of an item?

2019-03-29 Thread lampahome
I have data which contain access duration of each items. EX: t0~t4 is the access time duration. 1 means the item was accessed in the time duration, 0 means not. ID,t0,t1,t2,t3,t4 0,1,0,0,1 1,1,0,0,1 2,0,0,1,1 3,0,1,1,1 Can cluster the group which item will access for a continuous duration? Like

[scikit-learn] How to improve mse when training regression model with month-base data?

2019-03-24 Thread lampahome
I want to predict sold number of item in every day in month. But data is too huge, so I train it with incrementall learning ex: sklearn.neural_network.MLPRegressor I train data per 3 months ex: 1st training with data containing from Jan. to Mar. Then train from Apr. to Jun. Then I evaluate with

[scikit-learn] Any model can predict multiple trend from hierarchical data?

2019-03-17 Thread lampahome
My hierarchical data are about sell numbers of 3 hot drinks and 3 cold drinks each month. Generally, cluster them into two group which one contain hot and another contains cold is better. But I don't want to cluster. When I study about sklearn.linear_model, I found they can only predict one tren

[scikit-learn] What theory cause SGDRegressor can partial_fit but RandomForestRegressor can't?

2019-03-13 Thread lampahome
As title, I'm confused why some algo can partial_fit and some algo can't. For regression model, I found SGD can but RF can't. Is about the difference of algo? I thought it's able to partial_fit because gradient descent, or just another reason? thx ___

[scikit-learn] What's are the advantages and disadvantages of incremental learning?

2019-02-26 Thread lampahome
Generally speaking, we all know it's to save spaces with incremental learning. According to the ques in stackoverflow , it also said that. But what's the disadvantages? What I

[scikit-learn] Incremental learning but predict the older data not well?

2019-02-23 Thread lampahome
I tried to use SGDRegressor to train data incrementally because I have newer data everyday. But I found when I train with data for 30 days, and then predict the result of 1st day. The result is very different. Then I predict from 1st to 14th day, I found the result are all the same value. But res

[scikit-learn] How to deal with hierarchical and real-time analysis in machine learning?

2019-02-12 Thread lampahome
For example, I may have huge different regions and every regions have many or less points. And I also want to real-time to analyze the newest data and older data, but I don't want to put data into memory cuz I don't have enough memory. What I thought I can use is partial_fit to accept streaming d

[scikit-learn] How to design system if I have huge items to real time analysis?

2019-02-11 Thread lampahome
Hello, I'm figuring out some way to deal with real time regression on disk block access times. But I have multiple patterns of each block. Ex: Some block were accessed once a month, some blocks were accessed everyday. They all have different access patterns. The question is that how to predict ac

Re: [scikit-learn] Does model consider about previous training results after reloading model and then training with new data?

2019-01-31 Thread lampahome
> > > > I think the following could work if the estimators_ support partial_fit: > > voter = VotingClassifier(...) > voter.fit(...) > > For further training: > > for i in len(estimators_): > voter.estimators_[i].partial_fit(...) > > ok, maybe using Voting classifier to determine regression

Re: [scikit-learn] Does model consider about previous training results after reloading model and then training with new data?

2019-01-31 Thread lampahome
Sebastian Raschka 於 2019年2月1日 週五 下午1:48寫道: > Hi there, > > if you call the "fit" method, the learning will essentially start from > scratch. So no, it doesn't consider previous training results. However, certain algorithms are implemented with an additional partial_fit > method that would consid

[scikit-learn] Does model consider about previous training results after reloading model and then training with new data?

2019-01-31 Thread lampahome
As title, I'm confused. If I reload model and train with new data, what happened? 1st train old data -> save model -> reload -> train with new data Does the 2nd training will consider about previous training results? Or just a new result with new data? ___

Re: [scikit-learn] Can y of datasets be increasing/decreasing ratio when train regression model?

2019-01-30 Thread lampahome
> but again you need to look at the distribution of y and the assumptions of > the regressor. > > So in the first, Should I plot graph to check y is distribution when X changes? I'm just thinking about how to know if it's distribution. ___ scikit-learn ma

[scikit-learn] Can y of datasets be increasing/decreasing ratio when train regression model?

2019-01-30 Thread lampahome
I found many cases in kaggle to predict the quantity or trends. They all set the real quantity as y. But I have question is that does anyone set the changing ratio as y? Like: X y Day1 0.2 Day2 0.1 Day3 0.15 Day4 -0.1 y is the changing ratio compared with previous day. Why anybody set

[scikit-learn] Is there rule to determine X and y when train regression?

2019-01-29 Thread lampahome
I found many example to predict stock, house prices, taxi fare...etc. The field of y almost like below: y : the price of the day And X maybe the day, param which can affect price...etc. Now I want to predict sales of multiple items of multiple stores. Is suitable to let decrease/increase ratio o

Re: [scikit-learn] How to determine suitable cluster algo

2019-01-24 Thread lampahome
Maybe the suitable way is try-and-error? What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples That cost too much time for me. Maybe I should define the initial number of cluster based on execution time? Then analyze the next step

[scikit-learn] How to determine suitable cluster algo

2019-01-24 Thread lampahome
I want to do customized clustering algo for my datasets, that's cuz I don't want to try every algo and its hyperparameters. I though I just define the default range of import hyperparameters ex: number of cluster in K-means. I want to iterate some possible clutering alog like K-means, DBSCAN, AP.

[scikit-learn] Affinity Propagation is the best algo for without choosing the number of cluster?

2019-01-23 Thread lampahome
I search for clustering algo to cluster into groups without considering about number of groups. I found AP algo which needn't choose the number of clusters. In my experiments, AP cluster well without choosing any parameters. But I'm not sure any corner case which will caused clustering worse. D

Re: [scikit-learn] Any clustering algo to cluster multiple timing series data?

2019-01-21 Thread lampahome
How about scaling data first by MinMaxScaler and then cluster? What I thought is scaling can scale then into 0~1 section, and it can ignore the quantity of each data After scaling, it shows the increasing/decreasing ratio between each points. Then cluster then by the eucledian distance should wo

Re: [scikit-learn] Any clustering algo to cluster multiple timing series data?

2019-01-16 Thread lampahome
Mikkel Haggren Brynildsen 於 2019年1月17日 週四 下午3:07寫道: > What about dynamic time warping ? > I thought DTW is used to different length of two datasets But I only get the same length of two datasets. Maybe it doesn't work? ___ scikit-learn mailing list sci

[scikit-learn] Any clustering algo to cluster multiple timing series data?

2019-01-16 Thread lampahome
Cluster algo cluster samples by calculating the euclidean distance. I wonder if any clustering algo can cluster the timing series data? EX: Every items has there sold numbers of everyday. Item,Day1,Day2,Day3,Day4,Day5 A,1,5,1,5,1 B,10,50,10,50,10, C,4,70,30,10,50 The difference ratio of A and B a

[scikit-learn] Any clustering algo to cluster by the ratio of series data ?

2019-01-10 Thread lampahome
Cluster algo cluster samples by calculating the euclidean distance. I wonder if any clustering algo can cluster the series data? EX: Every items has there sold numbers of everyday. Item,Day1,Day2,Day3,Day4,Day5 A,1,5,1,5,1 B,10,50,10,50,10, C,4,70,30,10,50 The difference ratio of A and B are 500%

[scikit-learn] Does sklearn contain xgboost?

2019-01-08 Thread lampahome
As title Does sklearn contain xgboost to use? thanks ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] How GridSearchCV to get best_params?

2019-01-03 Thread lampahome
as title In the doc it says: best_params_ : dict Parameter setting that gave the best results on the hold out data. My question is what is the hold out data? It's score of train data or test data, or mean of train and test score? thx ___ scikit-learn

[scikit-learn] How to grab subsets from train sets when bootstrap=False in RF regressor?

2018-12-26 Thread lampahome
As title RF regressor decide a tree by grabing part of train data aka bootstrap. If set bootstrap=False, how would the model grab data? The reason I'm interesting is when I set it to False, it makes the mse and mae down, that's means False is better. _

[scikit-learn] Any way to tune the parameters better than GridSearchCV?

2018-12-24 Thread lampahome
Take random forest as example, if I give estimator from 10 to 1(10, 100, 1000, 1) into grid search. Based on the result, I found estimator=100 is the best, but I don't know lower or greater than 100 is better. How should I decide? brute force or any tools better than GridSearchCV? thx __

[scikit-learn] Does random forest work if there are very few features?

2018-12-20 Thread lampahome
I read doc and know tree-based model is determined by entropy or gini impurity. When model try to create leaf nodes, it will determine based on the feature, right? Ex: I have 2 features A,B, and I divide it with A. So I have left and right nodes based on A. It should have the best shape if I crea

[scikit-learn] time complexity of tree-based model?

2018-12-19 Thread lampahome
I do some benchmark in my experiments and I almost use ensemble-based regressor. What is the time complexity if I use random forest regressor? Assume I only set variable * estimators=100* and others doesn't enter. thx ___ scikit-learn mailing list sciki

[scikit-learn] Difference between linear model and tree-based regressor?

2018-12-12 Thread lampahome
Linear model like linear reg, Lasso reg, Elastic net reg...etc. Tree-based like ExtTree reg, Random forest reg...etc What's the difference between them? I observe one point is below: 1. linear model can be extrapolated? tree-based can't does it ___ sci

Re: [scikit-learn] Why some regression algo can predict multiple out?

2018-12-11 Thread lampahome
Joel Nothman 於 2018年12月11日 週二 下午5:56寫道: > Yes, some can use a shared model to predict multiple outputs (ElasticNet, > DecisionTreeRegressor, MLPRegressor), others can't. Those that can't can be > trivially extended to the multiple output case with MultiOutputRegressor, > by learning each output i

[scikit-learn] Why some regression algo can predict multiple out?

2018-12-11 Thread lampahome
As title, apart from sklearn.multioutput.MultiOutputRegressor, almost regression algo in sklearn only can predict 1-d output. Ex: predict 1-d output sklearn.linear_model.SGDRegressor fit(X, y, coef_init=None, intercept_init=None, sample_weight=None) y : numpy array, shape (n_samples,) Ex: predict

Re: [scikit-learn] Is there regression algo with 3-d input?

2018-12-07 Thread lampahome
Stuart Reynolds 於 2018年12月6日 週四 下午12:52寫道: > Would the output be different if you simply wrapped the whole process with > reshaping 3D input to 2d? > > Sometimes will changed a lot, sometimes will be similar. Maybe using neural network is what I want?

Re: [scikit-learn] Is there regression algo with 3-d input?

2018-12-05 Thread lampahome
Stuart Reynolds 於 2018年12月6日 週四 下午12:52寫道: > Would the output be different if you simply wrapped the whole process with > reshaping 3D input to 2d? > >> >> I don't know, I'm not experiencing on it. ___ scikit-learn mailing list scikit-learn@python.org h

[scikit-learn] Is there regression algo with 3-d input?

2018-12-05 Thread lampahome
I want to regress time series prediction per week, so the unit of train data X is the day ex: Mon, Tue, Wed...etc. Ex: train data X is like below X: [ [1,2,3,4,3,2,1] ,[2,2,3,4,3,2,2] ] Each data of each row is about the day of one week. So each row has 7 data. Now if I have another feature W i