I'm also having trouble with Nearest neighbor, although not sure if its
related. I'm doing a regression, and everytime I see that particular
warning, one or more of my predicted values ends up with a value of 'nan'.
I checked my inputs and none of them are nan.If I can create a small sample
that sh
I played around with this a bit: it appears to be related to a memory error.
https://gist.github.com/1666570
This fails after a few iterations. If the print statement is
uncommented, then it no longer fails.
The ball tree code uses a lot of raw memory views for speed... I'll have
a look through
2012/1/24 Blake Visin :
> Seems to have worked. Should I be concerned that it skipped 6 tests?
Nope, they are expected to be skipped.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
--
Keep Your Devel
Seems to have worked. Should I be concerned that it skipped 6 tests?
blake@blake-M4600:~/workspace/scikit-learn$ nosetests sklearn
/home/blake/workspace/scikit-learn/sklearn/cross_val.py:2: UserWarning:
sklearn.cross_val namespace is deprecated in version 0.9 and will be
removed in version 0.11.
2012/1/24 Blake Visin :
> I installed it using: sudo pip install -U scikit-learn
> pip freeze returns:
>
> scikit-learn==0.10
> numpy==1.6.1
> scipy==0.9.0
Alright, can you try to build the master and check whether you can
reproduce the test failure?
git clone https://github.com/scikit-learn/scik
I installed it using: sudo pip install -U scikit-learn
pip freeze returns:
scikit-learn==0.10
numpy==1.6.1
scipy==0.9.0
Thanks,
Blake
On Mon, Jan 23, 2012 at 3:58 PM, Olivier Grisel wrote:
> This looks like a bug in scikit-learn. Which version are you using? A
> released archive or an up to dat
This looks like a bug in scikit-learn. Which version are you using? A
released archive or an up to date clone from the master branch of the
github repo?
Also which version of numpy and scipy are installed on your box?
--
Olivier
--
2012/1/24 Jieyun Fu :
> Hi all,
>
> Is there a way to give observation weights to LogisticRegression module? I
> am referring to the weights for different observations. i.e., if we are
> feeding N samples into the regression, we should give N weights. >From the
> APIs, looks like we can only give w
I am trying to get started with scikit-learn and I am following the
tutorial here:
I am running on Ubuntu 11.10 Linux 3.0.0-14-generic x86_64 I have installed
all the necessary packages listed in the tutorial and here is the output
when running nosetests sklearn:
blake@blake-M4600:~/workspace/sci
Hi all,
Is there a way to give observation weights to LogisticRegression module? I
am referring to the weights for different observations. i.e., if we are
feeding N samples into the regression, we should give N weights. From the
APIs, looks like we can only give weights based on the classes.
Than
Sorry forgot the link. I am following the tutorial here:
http://scikit-learn.github.com/scikit-learn-tutorial/setup.html#install-scikit-learn-build-dependencies
On Mon, Jan 23, 2012 at 3:21 PM, Blake Visin wrote:
> I am trying to get started with scikit-learn and I am following the
> tutorial h
On 01/23/2012 10:38 PM, Andreas wrote:
> Hi everybody.
> I created a gist to illustrate the behavior:
> https://gist.github.com/1665623
> This reproduces a warning that is quite weird.
>
> After trying it out for some time, I found
> the following fun fact:
> It behaves only like it does when the m
Hi everybody.
I created a gist to illustrate the behavior:
https://gist.github.com/1665623
This reproduces a warning that is quite weird.
After trying it out for some time, I found
the following fun fact:
It behaves only like it does when the manifold
module is imported.
So if you remove the manif
Hi.
I don't know much about the modules that are involved
here but this looks like a bug to me.
I can reproduce the behavior you observe and am
looking into it.
I think Jake will be able to tell you more about this.
Cheers,
Andy
On 01/23/2012 08:15 PM, Alejandro Weinstein wrote:
> Hi:
>
> When
Hi:
When I run manifold.LocallyLinearEmbedding (using sklearn 0.10), as in
the following code,
###
from sklearn import manifold, datasets
n_points = 1000
n_neighbors = 10
out_dim = 2
X, _ = datasets.samples_generator.ma
2012/1/23 Dimitrios Pritsos :
>
> However, when I do the same test using partial_fit() for the same
> sub-set of my Data Set (see above) I am getting ~20%.
>
> Any Suggestions?
Do a grid search to find the best alpha on SGDClassifier (and on C for
the LinearSVC classifier). For instance:
>>> from
On 01/23/2012 07:01 PM, Dimitrios Pritsos wrote:
> On 01/23/2012 06:59 PM, Andreas wrote:
>> On 01/23/2012 05:54 PM, Dimitrios Pritsos wrote:
>>> Relate to LinearSVC() and SGDClassifier()
>>>
>>> I ran both with a subset of my 33k-samples by 30k-features and I am
>>> getting a huge difference in re
2012/1/23 Mathieu Blondel :
> On Tue, Jan 24, 2012 at 2:47 AM, Olivier Grisel
> wrote:
>
>> Tests are fine on numpy 1.5.1 and scipy 0.10.0:
>>
>> https://jenkins.shiningpanda.com/scikit-learn/job/python-2.7-numpy-1.5.1-scipy-0.10.0/
>>
>> Maybe a 1.6.1 specific issue? If this is a rounding issue t
On Tue, Jan 24, 2012 at 2:47 AM, Olivier Grisel
wrote:
> Tests are fine on numpy 1.5.1 and scipy 0.10.0:
>
> https://jenkins.shiningpanda.com/scikit-learn/job/python-2.7-numpy-1.5.1-scipy-0.10.0/
>
> Maybe a 1.6.1 specific issue? If this is a rounding issue triggering a
> classification switch on
2012/1/23 Alejandro Weinstein :
> Hi:
>
> I am trying to install the latest version of scikit-learn (59db66...).
> I cloned the repository, and typed 'make'. One of the unit tests is
> failing:
>
> ==
> FAIL: sklearn.tests.test_mul
Hi:
I am trying to install the latest version of scikit-learn (59db66...).
I cloned the repository, and typed 'make'. One of the unit tests is
failing:
==
FAIL: sklearn.tests.test_multiclass.test_ovr_fit_predict
-
On 01/23/2012 06:59 PM, Andreas wrote:
> On 01/23/2012 05:54 PM, Dimitrios Pritsos wrote:
>> Relate to LinearSVC() and SGDClassifier()
>>
>> I ran both with a subset of my 33k-samples by 30k-features and I am
>> getting a huge difference in results. Is this expected behavour!
>>
>> After 10-fold-cr
On 01/23/2012 05:54 PM, Dimitrios Pritsos wrote:
> Relate to LinearSVC() and SGDClassifier()
>
> I ran both with a subset of my 33k-samples by 30k-features and I am
> getting a huge difference in results. Is this expected behavour!
>
> After 10-fold-cross-validation (using the Defaults as arguments
Relate to LinearSVC() and SGDClassifier()
I ran both with a subset of my 33k-samples by 30k-features and I am
getting a huge difference in results. Is this expected behavour!
After 10-fold-cross-validation (using the Defaults as arguments in both
cases) I am getting:
Accuracy = 44% for SGD
Ac
On Mon, Jan 23, 2012 at 05:48:21PM +0100, Olivier Grisel wrote:
> I am not talking of adding a dependency on a redis client library in
> scikit-learn but just to make it possible to pass a "vocabulary"
> argument to the vectorizer that has the same behavior as python
> defaultdict but would use a r
2012/1/23 Gael Varoquaux :
> On Mon, Jan 23, 2012 at 05:27:10PM +0100, Olivier Grisel wrote:
>> Alternatively we could make a vocabulary dict implementation
>> based on a redis server.
>
> That's two mails in a row suggesting to bing the scikit with an advanced
> persistence engine: first Dimitrios
On Tue, Jan 24, 2012 at 1:38 AM, Mathieu Blondel wrote:
> Indeed, combined with your hashing text vectorizer, this will allow to
> cache the extracted features and thus make several epochs over the
> dataset (each epoch being broken down into several calls to
> partial_fit).
Actually, one call t
On Tue, Jan 24, 2012 at 1:27 AM, Olivier Grisel
wrote:
> I agree although this would be really useful once I am done with the
> hashing text vectorizer. Otherwise the vocabulary dict will explode in
> memory.
Indeed, combined with your hashing text vectorizer, this will allow to
cache the extrac
On Mon, Jan 23, 2012 at 05:27:10PM +0100, Olivier Grisel wrote:
> Alternatively we could make a vocabulary dict implementation
> based on a redis server.
That's two mails in a row suggesting to bing the scikit with an advanced
persistence engine: first Dimitrios suggesting to persist to pytables,
2012/1/23 Mathieu Blondel :
> We need a dump utility to incrementally append data to a mem-mapped
> array or csr matrix. This way, people would be able to do their
> feature extraction in an iterator and create the array / matrix
> incrementally.
I agree although this would be really useful once I
On 01/23/2012 06:14 PM, Mathieu Blondel wrote:
> We need a dump utility to incrementally append data to a mem-mapped
> array or csr matrix. This way, people would be able to do their
> feature extraction in an iterator and create the array / matrix
> incrementally.
>
> Mathieu
>
I will implement a
2012/1/23 Mathieu Blondel :
> On Tue, Jan 24, 2012 at 12:15 AM, Olivier Grisel
> wrote:
>
>> LSH is just using a binary thresholded random projections in 32 (or 64
>> or 128...) dim space. That leads to 32bit (or 64bit...) vectors
>> castable as integers and doing Hamming radius queries instead of
We need a dump utility to incrementally append data to a mem-mapped
array or csr matrix. This way, people would be able to do their
feature extraction in an iterator and create the array / matrix
incrementally.
Mathieu
--
On Tue, Jan 24, 2012 at 12:15 AM, Olivier Grisel
wrote:
> LSH is just using a binary thresholded random projections in 32 (or 64
> or 128...) dim space. That leads to 32bit (or 64bit...) vectors
> castable as integers and doing Hamming radius queries instead of
> Euclidean queries in that boolean
Olivier Grisel wrote:
> +1 for the dense case
>
> But ball tree does not work for high dim sparse data.
>
I'm working on that - I hope to have a pull request within the next few
weeks.
> We would also need some truncated kernels (e.g. cosine similarity for
> positive data or RBF in the gener
On Mon, Jan 23, 2012 at 04:15:36PM +0100, Olivier Grisel wrote:
> 2012/1/23 Gael Varoquaux :
> > On Mon, Jan 23, 2012 at 10:08:45AM +0100, Olivier Grisel wrote:
> >> Once we have random projections (or even just efficient hashing API),
> >> LSH is quite simple to implement on top.
> > I don't unde
2012/1/23 Gael Varoquaux :
> On Mon, Jan 23, 2012 at 10:08:45AM +0100, Olivier Grisel wrote:
>> Once we have random projections (or even just efficient hashing API),
>> LSH is quite simple to implement on top.
>
> I don't understand: they are quite orthogonal, aren't they?
You can implement LSH wi
2012/1/23 Gael Varoquaux :
> On Mon, Jan 23, 2012 at 02:17:21PM +0100, Olivier Grisel wrote:
>> Hehe, that would be nice but I am affraid Gael won't let me do this as
>> part of the main scikit repository: large scale examples mean
>> largescale datasets ;)
>
> Why can't we just generate data. The
On 01/23/2012 04:16 PM, Gael Varoquaux wrote:
> On Mon, Jan 23, 2012 at 04:07:16PM +0200, Dimitrios Pritsos wrote:
>> '0.11-git'<- is this the latest?
> We don't know: it depends on the revision number of the git checkout.
>
> Do you have a full git checkout? If so, just do a 'git pull'.
>
> G
On Mon, Jan 23, 2012 at 04:07:16PM +0200, Dimitrios Pritsos wrote:
> '0.11-git' <- is this the latest?
We don't know: it depends on the revision number of the git checkout.
Do you have a full git checkout? If so, just do a 'git pull'.
G
--
On 01/23/2012 03:58 PM, Lars Buitinck wrote:
> 2012/1/23 Dimitrios Pritsos:
>> I guess I misunderstood something here. There is no partial_fit(). Plus
>> I haven't manage to figure out how to do the partial fit.
>>
>> I have the latest SKLEART I retrieved by git. Am I missing something?
> Are you
2012/1/23 Dimitrios Pritsos :
> I guess I misunderstood something here. There is no partial_fit(). Plus
> I haven't manage to figure out how to do the partial fit.
>
> I have the latest SKLEART I retrieved by git. Am I missing something?
Are you sure? SGDClassifier.partial_fit was implemented some
On 01/23/2012 03:46 PM, Dimitrios Pritsos wrote:
> On 01/23/2012 03:20 PM, Olivier Grisel wrote:
>> 2012/1/23 Dimitrios Pritsos:
>>> On 01/23/2012 02:20 PM, Lars Buitinck wrote:
2012/1/23 Dimitrios Pritsos:
> On 01/23/2012 12:24 PM, Olivier Grisel wrote:
>> BTW: what is the structure o
On 01/23/2012 03:20 PM, Olivier Grisel wrote:
> 2012/1/23 Dimitrios Pritsos:
>> On 01/23/2012 02:20 PM, Lars Buitinck wrote:
>>> 2012/1/23 Dimitrios Pritsos:
On 01/23/2012 12:24 PM, Olivier Grisel wrote:
> BTW: what is the structure of you data in PyTables? Is is mapped to a
> scipy.sp
On Mon, Jan 23, 2012 at 02:17:21PM +0100, Olivier Grisel wrote:
> Hehe, that would be nice but I am affraid Gael won't let me do this as
> part of the main scikit repository: large scale examples mean
> largescale datasets ;)
Why can't we just generate data. The goal is to get the idea through, no
On 01/23/2012 03:07 PM, Olivier Grisel wrote:
> 2012/1/23 Lars Buitinck:
>> 2012/1/23 Dimitrios Pritsos:
>>> I will give it a try however in some of my tests had a memory management
>>> problem. As I can recall it was mostly because of numpy function that
>>> might ask from pyTable to load every th
On Mon, Jan 23, 2012 at 11:37:16AM +0200, Dimitrios Pritsos wrote:
> So, is there a any tip for me to fit() the model in stages i.e not to
> bring the whole data set in Memory during the learning process. As I can
> see in my code when I am giving an EArray as an argument to Fit() it
> seem to l
On 01/23/2012 02:46 PM, Lars Buitinck wrote:
> 2012/1/23 Dimitrios Pritsos:
>> I will give it a try however in some of my tests had a memory management
>> problem. As I can recall it was mostly because of numpy function that
>> might ask from pyTable to load every thing in main men. I guess some
>>
On Mon, Jan 23, 2012 at 10:08:45AM +0100, Olivier Grisel wrote:
> Once we have random projections (or even just efficient hashing API),
> LSH is quite simple to implement on top.
I don't understand: they are quite orthogonal, aren't they?
Gael
2012/1/23 Dimitrios Pritsos :
> On 01/23/2012 02:20 PM, Lars Buitinck wrote:
>> 2012/1/23 Dimitrios Pritsos:
>>> On 01/23/2012 12:24 PM, Olivier Grisel wrote:
BTW: what is the structure of you data in PyTables? Is is mapped to a
scipy.sparse Compressed Sparse Row datastructure? How many f
2012/1/23 Mathieu Blondel :
> On Mon, Jan 23, 2012 at 7:24 PM, Olivier Grisel
> wrote:
>> Have a look at `sklearn.linear_model.SGDClassifier` that supports a
>> partial_fit method in master that you can call several times with
>> slices of data.
>>
>> BTW: what is the structure of you data in PyTa
On Mon, Jan 23, 2012 at 10:07 PM, Olivier Grisel
wrote:
> Indeed SVC will not scale to 50k samples, only LinearSVC will. In any
> case I found SGDClassifier (with the fit method) to be much faster
> than LinearSVC or LogisticRegression (i.e. any liblinear based
> models). And discrete naive Bayes
2012/1/23 Lars Buitinck :
> 2012/1/23 Dimitrios Pritsos :
>> I will give it a try however in some of my tests had a memory management
>> problem. As I can recall it was mostly because of numpy function that
>> might ask from pyTable to load every thing in main men. I guess some
>> loops and some sl
2012/1/23 Dimitrios Pritsos :
> I will give it a try however in some of my tests had a memory management
> problem. As I can recall it was mostly because of numpy function that
> might ask from pyTable to load every thing in main men. I guess some
> loops and some slicing might solve the problem.
On 01/23/2012 02:20 PM, Lars Buitinck wrote:
> 2012/1/23 Dimitrios Pritsos:
>> On 01/23/2012 12:24 PM, Olivier Grisel wrote:
>>> BTW: what is the structure of you data in PyTables? Is is mapped to a
>>> scipy.sparse Compressed Sparse Row datastructure? How many features do
>>> you have in your data
On Mon, Jan 23, 2012 at 7:24 PM, Olivier Grisel
wrote:
> Have a look at `sklearn.linear_model.SGDClassifier` that supports a
> partial_fit method in master that you can call several times with
> slices of data.
>
> BTW: what is the structure of you data in PyTables? Is is mapped to a
> scipy.spars
2012/1/23 Dimitrios Pritsos :
> On 01/23/2012 12:24 PM, Olivier Grisel wrote:
>> BTW: what is the structure of you data in PyTables? Is is mapped to a
>> scipy.sparse Compressed Sparse Row datastructure? How many features do
>> you have in your dataset?
>
> The training data are in a EArray (Compre
On 01/23/2012 12:24 PM, Olivier Grisel wrote:
> Have a look at `sklearn.linear_model.SGDClassifier` that supports a
> partial_fit method in master that you can call several times with
> slices of data.
Thanx for the Ref I will have a look right now
> BTW: what is the structure of you data in PyTa
Thanks Adrian and Andreas. My kindle is packed :)
--
Olivier
--
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, Share
Le 23/01/2012 11:34, Olivier Grisel a écrit :
> 2012/1/23 Andreas:
>> On 01/23/2012 11:28 AM, Adrien wrote:
>>> Hello everyone,
>>>
>>> A quick question: why not use Nystrom instead?
>>>
>> That was on my GSoC wish list ;)
>> The application I had in mind at the moment was
>> the label propagation
On 01/23/2012 11:34 AM, Olivier Grisel wrote:
2012/1/23 Andreas:
On 01/23/2012 11:28 AM, Adrien wrote:
Hello everyone,
A quick question: why not use Nystrom instead?
That was on my GSoC wish list ;)
The application I had in mind at the moment was
the label propagation and m
2012/1/23 Andreas :
> On 01/23/2012 11:28 AM, Adrien wrote:
>> Hello everyone,
>>
>> A quick question: why not use Nystrom instead?
>>
> That was on my GSoC wish list ;)
> The application I had in mind at the moment was
> the label propagation and maybe the spectral clustering.
> In general, I thin
On 01/23/2012 11:28 AM, Adrien wrote:
> Hello everyone,
>
> A quick question: why not use Nystrom instead?
>
That was on my GSoC wish list ;)
The application I had in mind at the moment was
the label propagation and maybe the spectral clustering.
In general, I think the thresholding would work
Hello everyone,
A quick question: why not use Nystrom instead?
The effects of thresholding the kernel matrix is not very well
understood and makes you lose the positive-definiteness (i.e. it's not a
kernel matrix anymore). It's ok for spectral clustering as the Laplacian
is always positive sem
Have a look at `sklearn.linear_model.SGDClassifier` that supports a
partial_fit method in master that you can call several times with
slices of data.
BTW: what is the structure of you data in PyTables? Is is mapped to a
scipy.sparse Compressed Sparse Row datastructure? How many features do
you hav
I am sending it again with the correct Subject line, I am sorry
about that
Hello,
I am using Sklearn in combination with Pytables for Automated Genre
Identification of Web Pages.
The reason I am using Pytables is for executing Very hight Scale
Evaluation of SVM using 50,00
Hello,
I am using Sklearn in combination with Pytables for Automated Genre
Identification of Web Pages.
The reason I am using Pytables is for executing Very hight Scale
Evaluation of SVM using 50,000 samples for training. I know that that
might probably this will not have so much impact in my
On Mon, Jan 23, 2012 at 6:06 PM, Andreas wrote:
> It might be as easy as that.
> I guess I should try to see if this speeds up things.
If you use algorithm="brute", there should be no speed-up (it computes
all the distances and find those within the given radius...). If you
use ball-tree, it sho
2012/1/23 Gael Varoquaux :
> On Mon, Jan 23, 2012 at 09:46:41AM +0100, Olivier Grisel wrote:
>> But ball tree does not work for high dim sparse data.
>
> In this case, I think that the LSH option is a good one. There is an LSH
> in pybrain that can be adapted.
Once we have random projections (or e
On 01/23/2012 08:52 AM, Alexandre Gramfort wrote:
> I am not sure it is what you want but you could use:
>
> K = radius_neighbors_graph(X, radius, mode='distance')
> K.data **= 2
> K.data *= -gamma
> np.exp(K.data, out=K.data)
>
> no?
>
> Alex
>
It might be as easy as that.
I guess I should try
On Mon, Jan 23, 2012 at 09:46:41AM +0100, Olivier Grisel wrote:
> But ball tree does not work for high dim sparse data.
In this case, I think that the LSH option is a good one. There is an LSH
in pybrain that can be adapted.
Gael
--
2012/1/23 Alexandre Gramfort :
> I am not sure it is what you want but you could use:
>
> K = radius_neighbors_graph(X, radius, mode='distance')
> K.data **= 2
> K.data *= -gamma
> np.exp(K.data, out=K.data)
>
> no?
+1 for the dense case
But ball tree does not work for high dim sparse data.
We w
72 matches
Mail list logo