Hello,
I am dumping the dataset vectorized with TfidfVectorizer, target array, and
the classifier OneVsRestClassifierSGDClassifier(loss=log, n_iter=50,
alpha=0.1)), since I want to add it to a package. I use joblib library from
sklearn.externals to dump the vectors. The max memory used wh
Hi all,
I'm trying to write my own code for NB classifier method, just so I
could use prior distributions other than for example gaussian. To
start with, I scripted something similar to GaussianNB function in
Scikit Learn (see the code below), but the two approaches give me
different result (means
Indeed, when I tried to re run it on my Windows PC at home it also found
Nan.
The problem appears to be when I download the data using the script, since
I tried it with the data I downloaded from the Linux server and It ran fine.
Best
On Sat, Nov 17, 2012 at 11:44 AM, Leon Palafox wrote:
> He
Hey guys,
I think I figured out the problem, well, sorta, I downloaded everything on
my ubuntu server, and everything worked fine and dandy, the problem seem to
ve when I was running on my windows machine.
That's odd
Its on win32 and python 2.7
Greets
On Sat, Nov 17, 2012 at 3:43 AM, Jake Van
:/ darnit. I wanted to run CARTs and Neural Nets on it >_<. though it was a
mystery to me how that would work.
On 16 November 2012 19:06, Olivier Grisel wrote:
> You can also have a look at this answer on stackoverflow for more details:
>
>
> http://stackoverflow.com/questions/12460077/possibil
You can also have a look at this answer on stackoverflow for more details:
http://stackoverflow.com/questions/12460077/possibility-to-apply-online-algorithms-on-big-data-files-with-sklearn
--
Monitor your physical, virtua
Read your data from the hardrive or database by chunks of ~ 1000
samples for instance an the partial_fit method of the models that
supports it, typically online linear models such as Perceptron,
SGDClassifier (or PassiveAggressiveClassifier in master).
-
You can retry by replacing the sklearn/externals/joblib folder with
the joblib folder of this branch:
https://github.com/joblib/joblib/pull/44
--
Monitor your physical, virtual and cloud infrastructure from a single
web c
this would also be consistent with the evaluation done here:
http://wise.io/wiserf.html
cheers,
satra
On Fri, Nov 16, 2012 at 2:25 PM, Peter Prettenhofer <
peter.prettenho...@gmail.com> wrote:
> Hi,
>
> I did a quick benchmark to compare sklearn's RandomForestClassifier
> against R's randomFo
On 15 November 2012 23:20, Andreas Mueller wrote:
> [...]
> You can give GridSearchCV not only a grid but also a list of grids.
> I would go with that.
> (is that sufficiently documented?)
>
This doesn't appear to be document (at least not at
http://scikit-learn.org/dev/modules/generated/sklearn
So I have ~ 20gb and growing of data that I want to run some algorithms
on... how should I do so as this is... a giant amount of data.
Besides online techniques such as partials is there anyway to modify the
train method so it works on all of the data but queries ... as a stream? or
in chunks or t
Ahh.. sorry >_<. I thought I made a new thread... sigh.
On 16 November 2012 15:33, Fred Mailhot wrote:
> Check out SGDClassifier and partial_fit()...I've used these to good effect.
>
> Also, PROTIP: if you want decent help, don't piggy-back on threads that
> have nothing to do with your questio
Check out SGDClassifier and partial_fit()...I've used these to good effect.
Also, PROTIP: if you want decent help, don't piggy-back on threads that
have nothing to do with your question.
Just sayin'.
On 16 November 2012 12:23, Ronnie Ghose wrote:
> Any ideas for online learning with Scikit? I
Any ideas for online learning with Scikit? I have a data set that is > 20gb
that I want to train on I don't think I can do that easily, so what
should I do?
Thanks,
Shomiron Ghose
On 15 November 2012 15:45, Fred Mailhot wrote:
> Dear list,
>
> I'm using GridSearchCV to do some simple model
This is a really weird low level error. Maybe a python bug. I don't
have time to investigate but I someone else can reproduce it would be
interesting to try and make a minimalistic reproduction script that
just uses the python multiprocessing API.
--
If npy files do in fact work cross-platform, then I'm baffled. Any
ideas about what could be causing these NaNs in Leon's script? The
files on the website haven't been modified since they were put online.
Here's a more compact version of the NaN checking:
>>> import numpy as np
>>> data = np
Thanks to all for the tips on GridSearch with FeatureUnion, I'll be trying
those out today. And @amueller I've been following the development of your
PR for the random sampling of param space with great interest.
But back to the initial problem...it seems that an empty input is the
cause. My raw d
On 16 November 2012 17:14, Robert Kern wrote:
> On Fri, Nov 16, 2012 at 4:03 PM, Nelle Varoquaux
> wrote:
> >> Hi Leon,
> >> When I run your script, I get no instances of NaN in the data.
> >>
> >> I wonder if it's a problem with storing the data as a npy file. I asked
> >> around last spring a
On Fri, Nov 16, 2012 at 4:03 PM, Nelle Varoquaux
wrote:
>> Hi Leon,
>> When I run your script, I get no instances of NaN in the data.
>>
>> I wonder if it's a problem with storing the data as a npy file. I asked
>> around last spring and everybody seemed to think that the format is
>> compatible
On Fri, Nov 16, 2012 at 05:03:14PM +0100, Nelle Varoquaux wrote:
> I think numpy relies on pickle for those.
If you store only one array per file it doesn't. It uses a stable
cross-plateform format.
> Saving as txt is more reliable.
But completely inefficient.
G
---
>
> Hi Leon,
> When I run your script, I get no instances of NaN in the data.
>
> I wonder if it's a problem with storing the data as a npy file. I asked
> around last spring and everybody seemed to think that the format is
> compatible across platforms and numpy versions, but I may be wrong. Doe
Hi Leon,
When I run your script, I get no instances of NaN in the data.
I wonder if it's a problem with storing the data as a npy file. I asked
around last spring and everybody seemed to think that the format is
compatible across platforms and numpy versions, but I may be wrong.
Does anybody
22 matches
Mail list logo