Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Olivier Grisel
2014/1/28 Arnaud Joly : > > The code looks very simple and we can reuse our sparse random > projection matrices to spare some memory and speed up projections. > > > I don’t know annoy, but could it be random projections trees > as in http://cseweb.ucsd.edu/~dasgupta/papers/rptree-stoc.pdf? This is

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Gilles Louppe
Given our intent to release 1.0 in the next future, I think we should also make it clear in the wiki page that adding more and more algorithms is not exactly the direction in which we are going to. Maybe this is the opportunity to remove some of the old subjects from 2013 and instead add topics foc

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Arnaud Joly
On 28 Jan 2014, at 15:31, Olivier Grisel wrote: > 2014/1/28 Mathieu Blondel : >> >> >> >> On Tue, Jan 28, 2014 at 9:25 PM, Olivier Grisel >> wrote: >>> >>> While vanilla LSH is an interesting baseline for Approximate Nearest >>> Neighbors search, it is often too error-prone to be practicall

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Olivier Grisel
2014/1/28 Olivier Grisel : > >> As always, I think the rule of thumb for inclusion in scikit-learn should be >> that the algorithm is standard in the field and have a fairly high citation >> count. Is this the case for the algorithms you mention? What are the 2 or 3 >> most famous LSH algorithms? >

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Olivier Grisel
2014/1/28 Mathieu Blondel : > > > > On Tue, Jan 28, 2014 at 9:25 PM, Olivier Grisel > wrote: >> >> While vanilla LSH is an interesting baseline for Approximate Nearest >> Neighbors search, it is often too error-prone to be practically >> useful. There exists alternative data-driven ANN methods tha

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Mathieu Blondel
On Tue, Jan 28, 2014 at 9:25 PM, Olivier Grisel wrote: > While vanilla LSH is an interesting baseline for Approximate Nearest > Neighbors search, it is often too error-prone to be practically > useful. There exists alternative data-driven ANN methods that can have > a much lower error rates (depen

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Olivier Grisel
There is also this very interesting paper that I read a long time ago comparing vanilla LSA with k-means based hashing schemes for ANN: http://hal.inria.fr/docs/00/56/71/91/PDF/paper.pdf‎ -- Olivier -- WatchGuard Dimens

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Olivier Grisel
While vanilla LSH is an interesting baseline for Approximate Nearest Neighbors search, it is often too error-prone to be practically useful. There exists alternative data-driven ANN methods that can have a much lower error rates (depending on the data). Among the top implementations there are FLANN

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Arnaud Joly
You can also reduce the dimensionality using random projections. Arnaud On 28 Jan 2014, at 11:39, Nick Pentreath wrote: > Another important and related use case is to reduce the search space, for > example, in recommendation systems one often has to do the dot product, or > cosine similari

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Nick Pentreath
Another important and related use case is to reduce the search space, for example, in recommendation systems one often has to do the dot product, or cosine similarity, between two vectors of moderate dimension. But you have to do this in real-time across potentially millions of candidate items. In

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Robert Layton
Yes, but it doesn't suffer so much at high dimensions, as compared to something like the Euclidean distance. On 28 January 2014 20:59, Joel Nothman wrote: > I have previously seen that there is interest in LSH in scikit-learn, but > don't know much about its application to machine learning. Is

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Joel Nothman
I have previously seen that there is interest in LSH in scikit-learn, but don't know much about its application to machine learning. Is it basically used for nearest neighbour methods? On 28 January 2014 20:48, Robert Layton wrote: > In principle, I'm happy to be a mentor for LSH, as I've used

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Robert Layton
In principle, I'm happy to be a mentor for LSH, as I've used it quite a bit and implemented nilsimsa in python and javascript, as well as tested a number of other algorithms. I don't know much about GSOC though. What would I need to do? On 28 January 2014 20:23, Alexandre Gramfort < alexandre.gra

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Alexandre Gramfort
> I like the locality-sensitive hashing idea! +1 we need to cleanup the GSOC idea wiki page... Alex -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Vlad Niculae
I like the locality-sensitive hashing idea! Vlad On Tue Jan 28 10:04:36 2014, Nick Pentreath wrote: > This would be a great addition. > > Some ideas /code perhaps: http://nearpy.io/ > > > On Tue, Jan 28, 2014 at 10:59 AM, Mathieu Blondel > mailto:math...@mblondel.org>> wrote: > > If we have a

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Nick Pentreath
This would be a great addition. Some ideas /code perhaps: http://nearpy.io/ On Tue, Jan 28, 2014 at 10:59 AM, Mathieu Blondel wrote: > If we have a suitable mentor for it, locality-sensitive hashing (LSH) > would be a great GSOC subject: > http://en.wikipedia.org/wiki/Locality-sensitive_hashing

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Mathieu Blondel
If we have a suitable mentor for it, locality-sensitive hashing (LSH) would be a great GSOC subject: http://en.wikipedia.org/wiki/Locality-sensitive_hashing Mathieu -- WatchGuard Dimension instantly turns raw network data

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-20 Thread Mathieu Blondel
On Mon, Jan 20, 2014 at 2:49 AM, Gael Varoquaux < gael.varoqu...@normalesup.org> wrote: > > In terms of setting a GSOC proposal, a few advice for you or any student > interested (this is very general, do not take it as something that > specifically applies to you): > > * Keep in mind that scikit-l

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-19 Thread Şükrü Bezen
First of all, hi everyone, As Manoj mentioned, last year I applied with my collaborative filtering idea and not accepted mainly because I did not commit to the project. This year I will apply again and I have a few project ideas (I won't be avoiding the commits this time). I am writing my thesis

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-19 Thread Manoj Kumar
Hi Gael, Thanks for the reply. I had posted on the list about the Gaussian Mixture Model project over here http://sourceforge.net/mailarchive/message.php?msg_id=31860906 too. (Your name was listed as a potential mentor), . I understand that you are incredibly busy, but it would be great if you or

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-19 Thread Gael Varoquaux
Hi Manoj, Thanks a lot for your contributions to scikit-learn, and for stepping up to propose a GSOC. Let me give some high-level answers, as I am now too busy to get in the details, and we have a fantastic team that does it very well. As you have seen from the answers that you got to your email,

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
Thanks everyone for your quick responses. 1. Could someone point me to a list of GSoC ideas this year? 2. Is it okay, if I take up projects related to ideas, that have not yet been implemented. For example, a quick search tells me "Improving GMM" has not been implemented. Thanks.

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Olivier Grisel
2014/1/16 Joel Nothman : > There are still issues of whether this is in scikit-learn scope. For > example, does it make sense with sklearn's cross validation? Or will you > want to cross validate on both axes? Given that there is plenty of work to > be done that is well within scikit-learn's scope

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Joel Nothman
n be done using >>>> MRJob a hadoop-streaming wrapper for python. This is also a current field >>>> of research and I'm sure if you look into it you will find quite a lot of >>>> literature on the topic. >>>> >>>> 3> I am currently

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Kyle Kastner
cikit-crab which was started based upon a similar plan but I heard the >>> developers are rewriting the library currently and it might not be open to >>> the community for active development at present (not sure about this >>> though). But I just mentioned it thinking maybe if you too

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
and it might not be open to >> the community for active development at present (not sure about this >> though). But I just mentioned it thinking maybe if you took a look at the >> code, you would get some more ideas about what improvements could be made. >> https://

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Kyle Kastner
; code, you would get some more ideas about what improvements could be made. > https://github.com/muricoca/crab > > -- > *From:* Kyle Kastner [kastnerk...@gmail.com] > *Sent:* Wednesday, January 15, 2014 1:42 PM > *To:* scikit-learn-general@lists.sourceforge.net

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread nmura...@masonlive.gmu.edu
me more ideas about what improvements could be made. https://github.com/muricoca/crab From: Kyle Kastner [kastnerk...@gmail.com] Sent: Wednesday, January 15, 2014 1:42 PM To: scikit-learn-general@lists.sourceforge.net Subject: Re: [Scikit-learn-general] Google Summe

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
I'm extremely sorry, that message got sent half way through. (I pressed Ctrl + Enter by mistake) X = [["ham", "spam"], ["ram", "bam", "tam"]], and y = [[2, 3], [1, -3, 4]] and we do clf.fit(X, y) Suppose we would like to predict, what we would recommend the user x who has already rated "ram" as 1

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
Well y can be 2-D too, there are estimators like MultiTaskElasticNet especially meant for multi-task y. I was thinking something along these lines. Lets say ["ham", "spam", "ram", "bam", "tam"] are the five items. and if first user gives "ham" - 2 "spam" - 3 the second user gives "ram" - 1 "bam"

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Kyle Kastner
So X is the array of existing ratings, would y be a 2D array then? If not, how do you map the ratings given back to a single user (since y is typically, to my knowledge, 1D in sklearn)? I am still a little confused, but your example helped. Can you could go into a little more detail on X, x, and y

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
Thanks for your responses. @Kyle: At the risk of sounding really naive, I'd like to make the following comments. I'm referring to this paper that Sukru had posted, http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdf which is item based collaborative filtering. I don't think there is really any need for

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-15 Thread Alex Companioni
Not sure how to handle the data representation (masked arrays make sense), but you probably want to look into matrix completion. In particular, a visitor at Knewton recently discussed his experience implementing singular value projection

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-15 Thread Nick Pentreath
While I think collaborative filtering / recommendations may have a place in sklearn, it is true that the problem setting is a little different from most of the sklearn models. You may want to take a look into mrec (https://github.com/mendeley/mrec) where many well established CF approaches are imp

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-15 Thread Kyle Kastner
I looked into this once upon a time, and one of the key problems (from talking to Jake IIRC) is how to handle the "missing values" in the input array. You would either need a mask, or some kind of indexing system for describing which value goes where in the input matrix. Either way, this extra argu

[Scikit-learn-general] Google Summer of Code 2014

2014-01-15 Thread Manoj Kumar
Hello, First of all, thanks to the scikit-learn community for guiding new developers. I'm thankful for all the help that I've got with my Pull Requests till now. I hope that this is the right place to discuss GSoC related ideas (I've idled at the scikit-learn irc channel for quite a few occasions