[scikit-learn] help-Renaming features in Sckit-learn's CountVectorizer()

2018-03-05 Thread Ranjana Girish
Hai all,

I have a very large pandas dataframe. Below is the sample

   * Id  description*
1switvch for air conditioner transformer..
2control tfrmr...
3coling pad.
4DRLG machine
5hair smothing kit...

For further process, I will contruct doument-term matrix of above data
using Sckit-learn's countvectorizer

*countvec = CountVectorizer()*
*documenttermmatrix=countvec.fit_transform(  dataset['description'])*

I have to correct misspelled features in description. Replacing wrongly
spelled word with correctly spelled word  for large dataset is taking so
much of time.

So i thought of  correcting features using features list in count
vectorizer given by code

*features_names= **countvec.get_feature_names()*

*Is it possible to rename features using above list and further use it for
classification process???*

Thanks
Ranjana
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Need help in dealing with large dataset

2018-03-05 Thread CHETHAN MURALI
Dear All,

I am working on building a CNN model for image classification problem.
As par of it I have converted all my test images to numpy array.

Now when I am trying to split the array into training and test set I am
getting memory error.
Details are as below:

X = np.load("./data/X_train.npy", mmap_mode='r')
train_pct_index = int(0.8 * len(X))
X_train, X_test = X[:train_pct_index], X[train_pct_index:]
X_train = X_train.reshape(X_train.shape[0], 256, 256, 3)

X_train = X_train.astype('float32')
-MemoryError
Traceback (most recent call
last) in ()
  2 print("Normalizing Data")
  3 > 4 X_train = X_train.astype('float32')

*More information:*

*1. my python version is*

python --versionPython 3.6.4 :: Anaconda custom (64-bit)

*2. I am running the code in ubuntu ubuntu 16.04.*

*3. I have 32GB RAM*

*4. X_train.npy file that I have loaded to np.array is of size 20GB*

print("X_train Shape: ", X_train.shape)
X_train Shape:  (85108, 256, 256, 3)

I would be really glad if you can help me to overcome this problem.

Regards,
-
Chethan
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Need help in dealing with large dataset

2018-03-05 Thread Guillaume LemaƮtre
If you work with deep net you need to check the utils from the deep net
library.
For instance in keras, you should create a batch generator if you need to
deal with large dataset.
In patch torch you can use the data loader which and the ImageFolder from
torchvision which manage
the loading for you.

On 5 March 2018 at 17:19, CHETHAN MURALI  wrote:

> Dear All,
>
> I am working on building a CNN model for image classification problem.
> As par of it I have converted all my test images to numpy array.
>
> Now when I am trying to split the array into training and test set I am
> getting memory error.
> Details are as below:
>
> X = np.load("./data/X_train.npy", mmap_mode='r')
> train_pct_index = int(0.8 * len(X))
> X_train, X_test = X[:train_pct_index], X[train_pct_index:]
> X_train = X_train.reshape(X_train.shape[0], 256, 256, 3)
>
> X_train = X_train.astype('float32')
> -MemoryError  
>  Traceback (most recent call last) 
> in ()
>   2 print("Normalizing Data")
>   3 > 4 X_train = X_train.astype('float32')
>
> *More information:*
>
> *1. my python version is*
>
> python --versionPython 3.6.4 :: Anaconda custom (64-bit)
>
> *2. I am running the code in ubuntu ubuntu 16.04.*
>
> *3. I have 32GB RAM*
>
> *4. X_train.npy file that I have loaded to np.array is of size 20GB*
>
> print("X_train Shape: ", X_train.shape)
> X_train Shape:  (85108, 256, 256, 3)
>
> I would be really glad if you can help me to overcome this problem.
>
> Regards,
> -
> Chethan
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Need help in dealing with large dataset

2018-03-05 Thread Sebastian Raschka
Like Guillaume suggested, you don't want to load the whole array into memory if 
it's that large. There are many different ways for how to deal with this. The 
most naive way would be to break up your NumPy array into smaller NumPy array 
and load them iteratively with a running accuracy calculation. My suggestion 
would be to create a HDF5 file from the NumPy array where each entry is an 
image. If it's just the test images, you can also save a batch of them as entry 
because you don't need to shuffle them anyway.

Ultimately, the recommendation based on the sweet spot between performance and 
convenience depends on what DL framework you use. Since this is a scikit-learn 
forum, I suppose you are using sklearn objects (although, I am not aware that 
sklearn has CNNs). The DataLoader in PyTorch is universally useful though and 
can come in handy no matter what CNN implementation you use. I have some 
examples here if that helps:

- 
https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-celeba.ipynb
- 
https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-csv.ipynb

Best,
Sebastian


> On Mar 5, 2018, at 12:13 PM, Guillaume LemaƮtre  
> wrote:
> 
> If you work with deep net you need to check the utils from the deep net 
> library.
> For instance in keras, you should create a batch generator if you need to 
> deal with large dataset.
> In patch torch you can use the data loader which and the ImageFolder from 
> torchvision which manage
> the loading for you.
> 
> On 5 March 2018 at 17:19, CHETHAN MURALI  wrote:
> Dear All,
> 
> I am working on building a CNN model for image classification problem.
> As par of it I have converted all my test images to numpy array.
> 
> Now when I am trying to split the array into training and test set I am 
> getting memory error.
> Details are as below:
> 
> X = np.load("./data/X_train.npy", mmap_mode='r')
> 
> train_pct_index 
> = int(0.8 * len(X))
> 
> X_train
> , X_test = X[:train_pct_index], X[train_pct_index:]
> 
> X_train 
> = X_train.reshape(X_train.shape[0], 256, 256, 3)
> 
> 
> X_train 
> = X_train.astype('float32')
> 
> 
> 
> -
> MemoryError   Traceback (most recent call last)
>  in ()
> 
>   
> 2 print("Normalizing Data")
> 
>   
> 3
>  
> 
> > 4 X_train = X_train.astype('float32')
> More information:
> 
> 1. my python version is
> 
> python --
> version
> 
> Python 3.6.4 :: Anaconda custom (64-bit)
> 2. I am running the code in ubuntu ubuntu 16.04.
> 
> 3. I have 32GB RAM
> 
> 4. X_train.npy file that I have loaded to np.array is of size 20GB
> 
> print("X_train Shape: ", X_train.shape)
> 
> X_train 
> Shape:  (85108, 256, 256, 3)
> I would be really glad if you can help me to overcome this problem.
> 
> Regards,
> -
> Chethan
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> 
> -- 
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] help-Renaming features in Sckit-learn's CountVectorizer()

2018-03-05 Thread Joel Nothman
You can effectively merge features through matrix multiplication: multiply
the CountVectorizer output by a sparse matrix of shape (n_features_in,
n_features_out) which has 1 where the output feature corresponds to an
input feature. Your spelling correction then consists of building this
mapping matrix.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] transfer-learning for random forests

2018-03-05 Thread Andreas Mueller

http://scikit-learn.org/dev/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

On 02/16/2018 04:51 AM, peignier sergio wrote:


Hello,

I recently begun a research project on Transfer Learning with some 
colleagues. We would like to contribute to scikit-learn incorporating 
Transfer Learning functions for Random Forests as described in this 
recent paper: **


*https://arxiv.org/abs/1511.01258*

*


*Before starting we would like to ensure that no existing project is 
ongoing.


Thanks!

BR,

Sergio Peignier



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn