Re: Google Summer of Code: Bring out your projects
On Mar 12, 2010, at 1:22 AM, Robin Anil wrote: Shall I go and put some of the ideas up. I will do it as a whole for the project. Later we can re-assign things maybe ? How does that sound? Unlike other projects we cant really go an put a proposal like Implement back-propagation and expect a student to take it up and reduce things to map/reduce. Some of the ideas (i am going to be really ambitious/vague here, but write clear expectations or guidelines on what is an ideal proposal) 1) Implement a cool classifier over map/reduce 2) Implement a cool clustering algorithm on map/reduce 3) Implement a meta-learner to plugin to various classifiers in mahout and have bagging, boosting support. 4) Continuous performance benchmarking/dashboard maybe wrappers over EC2 5) Create a matrix implementations of MYSQL and NOSQL(hbase, cassandra) access for all the algorithms to use. 6) Implement some of the ideas from Netflix top 5 to boost recommendations packge 7) Visualization tool for clustering, classification or recommendation. ability to explain(optional) 8) Improve mahout-math package 9. Implement M/R Tika integration to take rich documents on HDFS and output Vectors. Likely not a full Summer of Work there, but could be part of some larger Utils capabilities focused on making it easier to consume Mahout. Also included: Finish ARFF compatibility. 10. Benchmark. Break the record? I think we should still solicit ideas on list here that we can put up on JIRA. Who is free to mentor this year? i.e giving 5-6 hours weekly to a student and hear then crib(sorry ian and isabel :P) and give words of encouragement. And yes, code reviews. I'm in.
Re: Google Summer of Code Proposal Submission
Grant, Thank you very much for the feedback! I'll make those changes and elaborations to my proposal very soon. Our thinking with the bi-grams is that, if we can maintain a relatively low error-rate in computing sets of similar words at a grassroots level, then we can have a powerful base case for inferring grammars on n-gram strings. But we haven't started thinking in great detail about how we'll graduate to a top-down parse. A lot of our current work is building off of research done by Lillian Lee over at Cornell. Their research, in a restricted test space of transitive verb-object noun pairs, looked at the benefits and limitations of a number of different similarity/distance measures. Based on their results, we've generalized the test space to an entire raw text corpus, and have been tweaking their measures to get better scores and be optimal over a cluster. Again, thanks for the feedback, Philip On Thu, Mar 26, 2009 at 1:11 AM, Grant Ingersoll gsing...@apache.orgwrote: Hi Philip, Thanks for the proposal. Sounds interesting. For the proposal that you submit, you should make sure to add references, details on how you plan to implement, etc. Of course, no need to do that in great depth on the wiki. Also, have you looked at going beyond just bi-grams? Not sure if it makes sense or not, but was just curious. Also, you should have a look at the Watchmaker stuff that is in Mahout already and maybe be able to address how what you are proposing relates. -Grant On Mar 25, 2009, at 7:37 PM, Philip Ramsey wrote: Hello Folks, I'm a student at The Evergreen State College and yesterday I submitted a proposal for the GSoC project to the wiki. I'm sending a link to my submission, with hopes that some of you might have feedback or questions or advice: http://wiki.apache.org/general/SoC2009/PhilipRamsey-Mahout-AlgorithmsProposal Thanks a lot, Philip Ramsey goal.oriented.des...@gmail.com
Re: Google Summer of Code Proposal Submission
Hi Philip, Thanks for the proposal. Sounds interesting. For the proposal that you submit, you should make sure to add references, details on how you plan to implement, etc. Of course, no need to do that in great depth on the wiki. Also, have you looked at going beyond just bi-grams? Not sure if it makes sense or not, but was just curious. Also, you should have a look at the Watchmaker stuff that is in Mahout already and maybe be able to address how what you are proposing relates. -Grant On Mar 25, 2009, at 7:37 PM, Philip Ramsey wrote: Hello Folks, I'm a student at The Evergreen State College and yesterday I submitted a proposal for the GSoC project to the wiki. I'm sending a link to my submission, with hopes that some of you might have feedback or questions or advice: http://wiki.apache.org/general/SoC2009/PhilipRamsey-Mahout-AlgorithmsProposal Thanks a lot, Philip Ramsey goal.oriented.des...@gmail.com
Re: Google Summer of Code
On Tuesday 22 April 2008, deneche abdelhakim wrote: So we are four students, that's cool. I wish us good work and great fun in this summer. I am really happy, we received a few slots more than expected. Welcome to the Mahout project to both of you and congratulations to the successful GSoC application. I wish you a lot of fun, working on your proposed topics and hope that all students can finish their work successfully. I think not only your individual mentors will help you, but as usual in Apache land the whole community will be happy to work with you on the mailing list. There were quite a few applications that unfortunately were not accepted. I would like to invite those who did not get selected to stick around and contribute. As Mahout is still young, it is especially easy to make a difference. So if you are interested in machine learning, we would be happy to welcome you here, even if Google does not sponsor your summer. Isabel -- Imbalance of power corrupts and monopoly of power corrupts absolutely. -- Genji |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google Summer of Code
Also, have a look at: http://www.apache.org/dev/ for more info. It would be helpful if all people (esp. GSOCers) who plan on contributing code file a CLA (http://www.apache.org/licenses/#clas) although it is not explicitly required, just makes things a bit nicer for us on the legal side. -Grant On Apr 22, 2008, at 7:58 AM, Grant Ingersoll wrote: Welcome aboard! We had a lot of very nice proposals, including a couple that were, unfortunately just below the cutoff. We (the ASF) had originally hoped to get more slots from Google, but they had an even bigger response from other projects as well. As it were, Mahout alone had something like 15 applicants, most of which were high quality and well-thought out. For those who didn't get selected, please do feel welcome here with the rest of us volunteers :-). To those accepted, do try to keep in mind that we should keep project discussions on the list. I think it is fine to ask mentors questions in private related to administrative stuff, but if you have questions about how to code something, etc. those are best handled on this list, as it creates a history and allows others to understand design decisions, etc. Cheers, Grant On Apr 21, 2008, at 10:28 PM, Robin Anil wrote: Hi Everyone, This is one of those days where I wake up and see that I have got accepted to GSoc with Mahout (:32-all-out:) . I am really excited to kick start the work. I know I have a lot to understand in terms of coding practices, the whole workflow/process. And i would like to congratulate and say hi to my fellow Gsoc'ers Farid, Yun and Abdel, Hi to my mentor Ian Holsman and to rest of the community. I am usually online of google talk: if you use it do add me: [EMAIL PROTECTED] Cheers and Good Day Robin
RE : Google Summer of Code
Hi Robin, I am very happy that I've been accepted, thanks to the Mahout Community that kindly commented on my draft. So we are four students, that's cool. I wish us good work and great fun in this summer. Hakim Robin Anil [EMAIL PROTECTED] a écrit : Hi Everyone, This is one of those days where I wake up and see that I have got accepted to GSoc with Mahout (:32-all-out:) . I am really excited to kick start the work. I know I have a lot to understand in terms of coding practices, the whole workflow/process. And i would like to congratulate and say hi to my fellow Gsoc'ers Farid, Yun and Abdel, Hi to my mentor Ian Holsman and to rest of the community. I am usually online of google talk: if you use it do add me: [EMAIL PROTECTED] Cheers and Good Day Robin __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail
Re: Google Summer of Code
On Tuesday 25 March 2008, Josh Harguess wrote: I have completed an application for Google Summer of Code for the implementation of the PCA algorithm in Mahout. My research is directly related to the use of PCA, so I am very familiar with that algorithm. Great! However, since I work in the area of pattern recognition and machine learning, I am also familiar with the other algorithms listed on your site. Since there was not a ranking of desired algorithms, I chose PCA, but if there is a more immediate need for a different algorithm, I can most likely help with that instead / as well. I think it is fine to choose the algorithm you are most familiar with. Currently we are happy to have someone who takes care of any of the algorithms. As you have experience with any of the algorithms, you could also contribute by taking part in the discussions on the mailing lists. Isabel -- Only God can make random selections. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google Summer of Code
On Tuesday 25 March 2008, Marko Novakovic wrote: Other components will be clasifier, crawler and indexer. So it will be the typical setup: Crawl web pages, classify them as positive or negative and in the end index them correctly? I would be especially interested in how the classifier will be build - as far as you can share any such knowledge on a public mailing list before September '08. I have idea about architecture in which all components will be run at each machine. I think the system architecture was pretty clear from the slides you sent. I would be nice if you could briefly sketch them on list as the slides have not survived being sent to a mailing list :) My idea for clustering would be making relevance by properties, like repetition keywods on page, relevant tags, keyword in subject etc. For each property will be allocated one axis and from n-dimensional space clustering machine will group pages by proper algrithm, in my case k-Means. If I understood the task correctly the goal is to build a system that is capable of separating posts that express some opinion from objective ones and afterwards to group positive vs. negative postings, right? I do not yet see, how the clustering algorithm k-means helps you achieve this task. If you want I will be able to describe detailed relevance for clustering with proper examples tomorrow. Sounds good. Isabel -- Life sucks, but death doesn't put out at all -- Thomas J. Kopp |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google Summer of Code
On Tuesday 25 March 2008, Marko Novakovic wrote: I attached beta version of presentation. I must consult with mentor form my college to examine exact which the role of clusterin is in this system. Hmm, one of the slides talks about using the clustering algorithm to identify new topics. I guess I still do not get the full picture. Did you happen to have a chance to look at the k-Means code in the repository yet? Isabel -- I don't mind arguing with myself. It's when I lose that it bothers me. -- Richard Powers |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google summer of code mahout-machine-learning
On Wednesday 19 March 2008, Frédéric wrote: Hello, I am a french student, currently studying distributed systems in Finland. Sounds interesting. What are you working on? To be honest I don't know all the algorithms listed in the paper. I think it is sufficient to either know at least one of them enough to work on a scalable, at best parallel version of it. Another option that I consider interesting is to look for some real world problem one would like to solve with machine learning and to work on the solution. Unfortunately, I have some exams to take this week and I'm sorry for not having enough time to give you more details. But I will give you more informations about my ideas and my skills related to this project as soon as possible. Looking forward to reading more about your ideas. Isabel -- The two things that can get you into trouble quicker than anything else are fast women and slow horses. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google Summer of Code
The cluster will be one component at search engine. Other components will be clasifier, crawler and indexer. I have idea about architecture in which all components will be run at each machine. Weba pages will be sent to cpu-s by hash function, which will be variable depending on inserting new or disposing or damaging working cpu-s. Between the crawler and the other part of system will be queue, from which will be scheduled pages by hash. My idea for clustering would be making relevance by properties, like repetition keywods on page, relevant tags, keyword in subject etc. For each property will be allocated one axis and from n-dimensional space clustering machine will group pages by proper algrithm, in my case k-Means. If you want I will be able to describe detailed relevance for clustering with proper examples tomorrow. Greetings --- Isabel Drost [EMAIL PROTECTED] wrote: On Monday 24 March 2008, Marko Novakovic wrote: and I am interesting to implement this clustering algorithm at Handop platform. So you would like to get a distributed clustering algorithm for grouping search results? It would be nice to hear more about your approach to this problem. There are a few guys here who have been working on clustering search results already. I guess they might be able to provide some help as well. We already have a k-Means implementation, but so far it has not been integrated into a search result clustering context. Isabel -- Science is what happens when preconception meets verification. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
RE: Google Summer of Code[esp. More Clustering]
Hi Matthew, I've implemented a minimal, non-MR version of the algorithm below to see how it would behave. The operant code is in TestMeanShift.testMeanShift() and MeanShiftCanopy.mergeCanopy(). The rest of the MR classes are stuff I copied from Canopy so you can ignore them. The TestMeanShift.setUp() method builds a 100x3 point matrix that represents a 10x10 image with the diagonal intensified (i.e. a '\' character mask). Then testMeanShift()creates an initial set of 100 canopies from it and iterates over the canopies merging their centroids into a new set of canopies until the canopy list size does not change any more. Finally it prints out the canopies that were found for each cell in the original image. Every time two canopies come within T2 distance of each other they merge, reducing the number of canopies. The original points that were bound to each canopy are also merged so that, at the end of the iteration, the original points are available in their respective canopies. Depending upon the values chosen for T1 and T2, the process either converges quickly or slowly. The loop terminates before actual convergence is achieved, but it does seem to cluster the input coherently. I hesitate to call this MeanShift but it is something similar that follows the same general algorithm, as I understand it at least. I hope you find it interesting. Jeff -Original Message- From: Jeff Eastman [mailto:[EMAIL PROTECTED] Sent: Monday, March 10, 2008 9:09 PM To: mahout-dev@lucene.apache.org Subject: RE: Google Summer of Code[esp. More Clustering] Hi Matthew, I'd like to pursue that canopy thought a little further and mix it in with your sub sampling idea. Optimizing can come later, once we figure out how to do mean-shift in M/R at all. How about this? 1. Each mean-shift iteration consists of a canopy clustering of all the points, with T1 set to the desired sampling resolution (h?) and T2 set to 1. This will create one canopy centered on each point in the input set which contains all of its neighbors that are close enough to influence its next position in its trajectory (the window?). 2. We then calculate the centroid of each canopy (that's actually done already by the canopy cluster reducer). Is this centroid not also the weighted average you desire for the next location of the point at its center? 3. As the computation proceeds, the canopies will collapse together as their various centroids move inside the T2=1 radius. At the point when all points have converged, the remaining canopies will be the mean-shift clusters (modes?) of the dataset and their contents will be the migrated points in each cluster. 4. If each original point is duplicated as its own payload, then the iterations will produce clusters of migrated points whose payloads are the final contents of each cluster. Can you wrap your mind around this enough to validate my assumptions? Jeff -Original Message- From: Matthew Riley [mailto:[EMAIL PROTECTED] Sent: Monday, March 10, 2008 5:58 PM To: mahout-dev@lucene.apache.org Subject: Re: Google Summer of Code[esp. More Clustering] Hi Jeff- I think your basin of attraction understanding is right on. I also like your ideas for distributing the mean-shift iterations by following a canopy-style method. My intuition was a little different, and I would like to hear your ideas on it: Just to make sure we're on the same page Say we have 1 million point in our original dataset, and we want to cluster by mean-shift. At each iteration of mean-shift we subsample (say) 10,000 points from the original dataset and follow the gradient of those points to the region of highest density (and as we saw from the paper, rather than calculate the gradient itself we can equivalently compute the weighted average of our subsampled points and move the centroid to that point). This part seems fairly straightforward to distribute - we just send a different subsampled set to each processor and each processor returns the final centroid for that set. The problem I see is that 10,000 points (or whatever value we choose), may be too much for a single processor if we have to compute the distance to every single point when we compute the weighted mean. My thought here was to exploit the fact that we're using a kernel function (gaussian, uniform, etc.) in the weighted mean calculation and that kernel will have a set radius. Because the radius is static, it may be easy to (quickly) identify the points that we must consider in the calculation (i.e. those within the radius) by using a locality sensitive hashing scheme, tuned to that particular radius. Of course, the degree of advantage we get from this method will depend on the data itself, but intuitively I think we will usually see a dramatic improvement. Honestly, I should do more background work developing this idea, and possibly try
Re: Google Summer of Code
On Mon, Mar 10, 2008 at 4:50 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: Wow, maybe w/ all of our mentors we could get 2 students... neat ++ :) -- ((Anush Shetty)) ((mail AT anushshetty DOT com))
RE: Google Summer of Code[esp. More Clustering]
Hi Matthew, I'd like to pursue that canopy thought a little further and mix it in with your sub sampling idea. Optimizing can come later, once we figure out how to do mean-shift in M/R at all. How about this? 1. Each mean-shift iteration consists of a canopy clustering of all the points, with T1 set to the desired sampling resolution (h?) and T2 set to 1. This will create one canopy centered on each point in the input set which contains all of its neighbors that are close enough to influence its next position in its trajectory (the window?). 2. We then calculate the centroid of each canopy (that's actually done already by the canopy cluster reducer). Is this centroid not also the weighted average you desire for the next location of the point at its center? 3. As the computation proceeds, the canopies will collapse together as their various centroids move inside the T2=1 radius. At the point when all points have converged, the remaining canopies will be the mean-shift clusters (modes?) of the dataset and their contents will be the migrated points in each cluster. 4. If each original point is duplicated as its own payload, then the iterations will produce clusters of migrated points whose payloads are the final contents of each cluster. Can you wrap your mind around this enough to validate my assumptions? Jeff -Original Message- From: Matthew Riley [mailto:[EMAIL PROTECTED] Sent: Monday, March 10, 2008 5:58 PM To: mahout-dev@lucene.apache.org Subject: Re: Google Summer of Code[esp. More Clustering] Hi Jeff- I think your basin of attraction understanding is right on. I also like your ideas for distributing the mean-shift iterations by following a canopy-style method. My intuition was a little different, and I would like to hear your ideas on it: Just to make sure we're on the same page Say we have 1 million point in our original dataset, and we want to cluster by mean-shift. At each iteration of mean-shift we subsample (say) 10,000 points from the original dataset and follow the gradient of those points to the region of highest density (and as we saw from the paper, rather than calculate the gradient itself we can equivalently compute the weighted average of our subsampled points and move the centroid to that point). This part seems fairly straightforward to distribute - we just send a different subsampled set to each processor and each processor returns the final centroid for that set. The problem I see is that 10,000 points (or whatever value we choose), may be too much for a single processor if we have to compute the distance to every single point when we compute the weighted mean. My thought here was to exploit the fact that we're using a kernel function (gaussian, uniform, etc.) in the weighted mean calculation and that kernel will have a set radius. Because the radius is static, it may be easy to (quickly) identify the points that we must consider in the calculation (i.e. those within the radius) by using a locality sensitive hashing scheme, tuned to that particular radius. Of course, the degree of advantage we get from this method will depend on the data itself, but intuitively I think we will usually see a dramatic improvement. Honestly, I should do more background work developing this idea, and possibly try a matlab implementation to test the feasibility. This sounds more like a research paper than something we should dive into immediately, but I wanted to share the idea and get some feedback if anyone has thoughts... Matt On Mon, Mar 10, 2008 at 11:29 AM, Jeff Eastman [EMAIL PROTECTED] wrote: Hi Matthew, I've been looking over the mean-shift papers for the last several days. While the details of the math are still sinking in, it looks like the basic algorithm might be summarized thusly: Points in an n-d feature space are migrated iteratively in the direction of maxima in their local density functions. Points within a basin of attraction all converge to the same maxima and thus belong to the same cluster. A physical analogy might be(?): Gas particles in 3-space, operating with gravitational attraction but without momentum, would tend to cluster similarly. The algorithm seems to require that each point be compared with every other point. This might be taken to require each mapper to see all of the points, thus frustrating scalability. OTOH, Canopy clustering avoids this by clustering the clusters produced by the subsets of points seen by each mapper. K-means has the requirement that each point needs to be compared with all of the cluster centers, not points. It has a similar iterative structure over clusters (a much smaller, constant number) that might be employed. There is a lot of locality in the local density function window, and this could perhaps be exploited. If points could be pre-clustered (as canopy is often used to prime the k-means iterations
Re: Google Summer of Code
Hi Grant. I'll be happy to mentor someone for this project. regards Ian | A person or group responsible for review and ranking of student | applications, I'd be happy to help out here. Anyone else? Cool
Re: Google Summer of Code
Note, the deadline for project proposals is March 12. I put an item up for us at: http://wiki.apache.org/general/SummerOfCode2008 I think it is probably general enough to cover all of the bases discussed here. Please feel free to add your name to the list of mentors if you can. Perhaps we can share duties. -Grant On Mar 7, 2008, at 1:43 PM, Isabel Drost wrote: On Friday 07 March 2008, Grant Ingersoll wrote: Sounds good. I should also note that all mentoring should (barring personal conversation) should take place on the dev list. That is, decisions, discussions on what to do should be done on the list so that we all benefit from the understanding. Not that you were suggesting otherwise! Sure, after all, GSoC is about integrating students into free software projects - and making decisions offline certainly is not the way, Apache projects work. Thanks for pointing that out. Isabel
Re: Google Summer of Code
On Thursday 06 March 2008, Matthew Riley wrote: I would basically be interested in doing anything that fits in well with the overall goals of the Mahout project. Whether that is implementing well known algorithms within the Hadoop framework or working on some novel idea is up to the mentors, I presume. I would be happy with both options: Working on well known algorithms within the Hadoop framework certainly is one of our main goals. But at least me personally am also interested in providing space for novel ideas. I consider it really important for researchers to not only publish the data they experimented on but also the implementation used. If working on the latter within Mahout helps to maybe focus a little more than usual on scalability and maintainability - great. So if you have an idea that fits well with your day to day work as well as with the overall goals of Mahout that would be fine. I would guess, this makes it easier to find some spare time to work on the project ;) Isabel -- Each new user of a new system uncovers a new class of bugs. -- Kernighan |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google Summer of Code
What about encouraging your students to submit their work at Mahout? Just a naive thought of mine. Those students I'm in charge of have their area of interest defined already -- too late to change it. Good idea for the future, I have been thinking about it, actually. D.
Re: Google Summer of Code
On Thursday 06 March 2008, Grant Ingersoll wrote: I think we can split the duties a bit, too. I think the Apache FAQ also said that - according with the usual Apache way of doing things - it would be ok if the GSoC students would receive help from all community members. So the actual time spent for one mentor could very well drop to about 3h per week. Still I would not rely on that when accepting the duty to become a mentor - after all, at least officially it is the mentor who is responsible for encouraging the student. Isabel -- The bug stops here. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google Summer of Code
On Mar 7, 2008, at 3:08 AM, Isabel Drost wrote: On Thursday 06 March 2008, Grant Ingersoll wrote: I think we can split the duties a bit, too. I think the Apache FAQ also said that - according with the usual Apache way of doing things - it would be ok if the GSoC students would receive help from all community members. So the actual time spent for one mentor could very well drop to about 3h per week. Still I would not rely on that when accepting the duty to become a mentor - after all, at least officially it is the mentor who is responsible for encouraging the student. Sounds good. I should also note that all mentoring should (barring personal conversation) should take place on the dev list. That is, decisions, discussions on what to do should be done on the list so that we all benefit from the understanding. Not that you were suggesting otherwise! -Grant
Re: Google Summer of Code
On Friday 07 March 2008, Grant Ingersoll wrote: Sounds good. I should also note that all mentoring should (barring personal conversation) should take place on the dev list. That is, decisions, discussions on what to do should be done on the list so that we all benefit from the understanding. Not that you were suggesting otherwise! Sure, after all, GSoC is about integrating students into free software projects - and making decisions offline certainly is not the way, Apache projects work. Thanks for pointing that out. Isabel -- Never pay a compliment as if expecting a receipt. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google Summer of Code
I think the Mentoring Org is already setup. After March 3, mentors can register. See http://wiki.apache.org/general/SummerOfCode2008. I'm willing to mentor, but would like to share the load a bit too. -Grant On Mar 6, 2008, at 1:56 AM, Isabel Drost wrote: On Saturday 01 March 2008, Grant Ingersoll wrote: Also, any thoughts on what we might want someone to do? I think it would be great to have someone implement one of the algorithms on our wiki. Just as a general note, the deadline for applications: March 12: Mentoring organization application deadline (12 noon PDT/ 19:00 UTC). I suppose we should identify interesing tasks until that deadline. As a general guideline for mentors and for project proposals: http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors Isabel -- Better late than never. -- Titus Livius (Livy) |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED]
Re: Google Summer of Code[esp. More Clustering]
I haven't read the papers, but the big question is do you think they can scale using M/R or some other distributed techniques? If so, feel free to write up a bit of a proposal using the info at: http://wiki.apache.org/general/SummerOfCode2008 If you are unsure, that is fine too. We could start with a simpler implementation, and then look to distribute it. On Mar 6, 2008, at 2:45 PM, Matthew Riley wrote: Hey Jeff- I'm certainly willing to put some energy into developing implementations of these algorithms, and it's good to hear that you may be interested in guiding us in the right direction. Here are the references I learned the algorithms from- some are more detailed than others: Mean-Shift clustering was introduced here and this paper is a thorough reference: Mean-Shift: A Robust Approach to Feature Space Analysis http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf And here's a PDF with just guts of the algorithm outlined: homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf It looks like there isn't a definitive reference for the k-means approximation with randomized k-d trees, but there are promising results introduced here: Object retrieval with large vocabularies and fast spatial matching: http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf* * And a deeper explanation of the technique here: Randomized KD-Trees for Real-Time Keypoint Detection: ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521 Let me know what you think. Matt On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman [EMAIL PROTECTED] wrote: Hi Matthew, As with most open source projects, interest is mainly a function of the willingness of somebody to contribute their energy. Clustering is certainly within the scope of the project. I'd be interested in exploring additional clustering algorithms with you and your colleague. I'm a complete noob in this area and it is always enlightening to work with students who have more current theoretical exposures. Do you have some links on these approaches that you find particularly helpful? Jeff -Original Message- From: Matthew Riley [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 05, 2008 11:11 PM To: mahout-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Google Summer of Code Hey everyone- I've been watching the mailing list for a little while now, hoping to contribute once I became more familiar, but I wanted to jump in here now and express my interest in the Summer of Code project. I'm currently a graduate student in electrical engineering at UT-Austin working in computer vision, which is closely tied to many of the problems Mahout is addressing (especially in my area of content-based retrieval). What can I do to help out? I've discussed some potential Mahout projects with another student recently- mostly focused around approximate k-means algorithms (since that's a problem I've been working on lately). It sounds like you guys are already implementing canopy clustering for k-means- Is there any interest in developing another approximation algorithm based on randomized kd- trees for high dimensional data? What about mean-shift clustering? Again, I would be glad to help in any way I can. Matt On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost [EMAIL PROTECTED] drost.de wrote: On Saturday 01 March 2008, Grant Ingersoll wrote: Also, any thoughts on what we might want someone to do? I think it would be great to have someone implement one of the algorithms on our wiki. Just as a general note, the deadline for applications: March 12: Mentoring organization application deadline (12 noon PDT/19:00 UTC). I suppose we should identify interesing tasks until that deadline. As a general guideline for mentors and for project proposals: http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors Isabel -- Better late than never. -- Titus Livius (Livy) |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] -- Grant Ingersoll http://www.lucenebootcamp.com Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Google Summer of Code
Hey Dawid, Is it information retrieval from visual data you're working on? We have recently had a presentation about a guy who implemented motion detection on GPUs with very impressive speedups (orders of magnitude compared to normal CPUs). I'm wondering if your expertise here could be used to implement map-reduce distributed jobs for running multiple GPUs in parallel. I know this sounds a bit crazy, but I've heard of bio-engineering companies doing just that -- running a cluster of GPUs to speed up their computations. Just a wild thought. Back to your proposal though. Yes, it is basically information retrieval that I'm performing on sets of images- in fact, a lot of the best algorithms employed today for object detection, object retrieval, etc. are adaptations of basic text-retrieval approaches (e.g. tfidf-weighted vector space models). I've personally never worked with GPUs for image processing, but I imagine the vector processing abilities would be useful at almost every stage of the indexing and retrieval processes. I would be interested in looking into those possibilities in more details. mostly focused around approximate k-means algorithms (since that's a problem I've been working on lately). It sounds like you guys are already implementing canopy clustering for k-means- Is there any interest in developing another approximation algorithm based on randomized kd-trees for high dimensional data? What about mean-shift clustering? From my experience the largest challenge in data clustering is not figuring out a new clustering methodology, but finding the right existing one to tackle a particular problem. Isabel mentioned web spam detection challenge -- this is a good example of a multi-feature classification problem and I know people have tried clustering the host graph to come up with more coarse-grained features for hosts. From my own interest, a very interesting challenge is doing something like Google News does (event aggregation). This is less trivial than you might think at first -- most news are very similar to each other (copy/paste and editing changes), so it's trivial to find small clusters of near-clones. Then the problem becomes more difficult because all news speak about pretty much the same people/ events (take presidential election in the U.S.). I think the problems you could state here are: 1) approximating optimal clustering granularity (call it the number of clusters if you wish, although I think clustering should be driven by other factors rather than just the number of clusters), 2) coming up with clusters of news items _other_ than keyword-based similarity. One example here is grouping news by region (geolocation), sentiment (positive/ negative news), people-related news, etc. 3) multilingual news matching and clustering. All the above issues are on the border of different domains -- NLP, clustering, classification. The tricky part is being able to put them together. What would be of interest to you? These are all interesting problems, actually. I've done some research into sentiment analysis, as you mentioned in (2), and I think it's still a wide open problem. Oren Etzioni at UWash does some interesting related work: www.cs.washington.edu/homes/etzioni/. I would basically be interested in doing anything that fits in well with the overall goals of the Mahout project. Whether that is implementing well known algorithms within the Hadoop framework or working on some novel idea is up to the mentors, I presume. Personally, if I'm going to be working on something novel, I would like to relate it to my current research work... and I'm happy to discuss that with anyone on the list who is interested. Matt D. Again, I would be glad to help in any way I can. Matt On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost [EMAIL PROTECTED] wrote: On Saturday 01 March 2008, Grant Ingersoll wrote: Also, any thoughts on what we might want someone to do? I think it would be great to have someone implement one of the algorithms on our wiki. Just as a general note, the deadline for applications: March 12: Mentoring organization application deadline (12 noon PDT/19:00 UTC). I suppose we should identify interesing tasks until that deadline. As a general guideline for mentors and for project proposals: http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors Isabel -- Better late than never. -- Titus Livius (Livy) |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED]
Re: Google Summer of Code[esp. More Clustering]
Hey Grant- I believe scaling Mean-Shift clustering using M/R will be pretty straightforward. I'm not as sure about K-Means using KD-Trees, since I haven't personally implemented that algorithm, but since it follows K-Means fairly closely I imagine it is possible. I'll get to work on a proposal with some of my ideas, and hopefully get some feedback from you guys during the process. Thanks for all the responses so far. Matt On Thu, Mar 6, 2008 at 3:25 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: I haven't read the papers, but the big question is do you think they can scale using M/R or some other distributed techniques? If so, feel free to write up a bit of a proposal using the info at: http://wiki.apache.org/general/SummerOfCode2008 If you are unsure, that is fine too. We could start with a simpler implementation, and then look to distribute it. On Mar 6, 2008, at 2:45 PM, Matthew Riley wrote: Hey Jeff- I'm certainly willing to put some energy into developing implementations of these algorithms, and it's good to hear that you may be interested in guiding us in the right direction. Here are the references I learned the algorithms from- some are more detailed than others: Mean-Shift clustering was introduced here and this paper is a thorough reference: Mean-Shift: A Robust Approach to Feature Space Analysis http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf And here's a PDF with just guts of the algorithm outlined: homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf It looks like there isn't a definitive reference for the k-means approximation with randomized k-d trees, but there are promising results introduced here: Object retrieval with large vocabularies and fast spatial matching: http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf*http://www.robots.ox.ac.uk/%7Evgg/publications/papers/philbin07.pdf* * And a deeper explanation of the technique here: Randomized KD-Trees for Real-Time Keypoint Detection: ieeexplore.ieee.org/iel5/9901/31473/01467521.pdf?arnumber=1467521 Let me know what you think. Matt On Thu, Mar 6, 2008 at 11:45 AM, Jeff Eastman [EMAIL PROTECTED] wrote: Hi Matthew, As with most open source projects, interest is mainly a function of the willingness of somebody to contribute their energy. Clustering is certainly within the scope of the project. I'd be interested in exploring additional clustering algorithms with you and your colleague. I'm a complete noob in this area and it is always enlightening to work with students who have more current theoretical exposures. Do you have some links on these approaches that you find particularly helpful? Jeff -Original Message- From: Matthew Riley [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 05, 2008 11:11 PM To: mahout-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Google Summer of Code Hey everyone- I've been watching the mailing list for a little while now, hoping to contribute once I became more familiar, but I wanted to jump in here now and express my interest in the Summer of Code project. I'm currently a graduate student in electrical engineering at UT-Austin working in computer vision, which is closely tied to many of the problems Mahout is addressing (especially in my area of content-based retrieval). What can I do to help out? I've discussed some potential Mahout projects with another student recently- mostly focused around approximate k-means algorithms (since that's a problem I've been working on lately). It sounds like you guys are already implementing canopy clustering for k-means- Is there any interest in developing another approximation algorithm based on randomized kd- trees for high dimensional data? What about mean-shift clustering? Again, I would be glad to help in any way I can. Matt On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost [EMAIL PROTECTED] drost.de wrote: On Saturday 01 March 2008, Grant Ingersoll wrote: Also, any thoughts on what we might want someone to do? I think it would be great to have someone implement one of the algorithms on our wiki. Just as a general note, the deadline for applications: March 12: Mentoring organization application deadline (12 noon PDT/19:00 UTC). I suppose we should identify interesing tasks until that deadline. As a general guideline for mentors and for project proposals: http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors Isabel -- Better late than never. -- Titus Livius (Livy) |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] -- Grant Ingersoll http://www.lucenebootcamp.com Next Training: April 7, 2008
Re: Google Summer of Code
On Saturday 01 March 2008, Grant Ingersoll wrote: Any of the other committers willing to mentor? Could you please clarify - or point to a page that does so - about what it means to become a Mentor? Anyone have any experience being a mentor? I would be happy to help - but I would rather learn a bit more about the mentor side of GSoC Also, any thoughts on what we might want someone to do? I think it would be great to have someone implement one of the algorithms on our wiki. I think just implementing one of the algorithms might help Mahout but it might be a bit hard to attract students to do that without some real task at hand. What about putting up tasks that solve problems e.g. from this years KDD cup or the web spam challenge? Than the benefit for participants would be two fold - first they would help Mahout and second they could compete with others in the field. Isabel -- A man's best friend is his dogma. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google Summer of Code
On Wed, Mar 5, 2008 at 8:52 PM, Isabel Drost [EMAIL PROTECTED] wrote: On Saturday 01 March 2008, Grant Ingersoll wrote: Any of the other committers willing to mentor? Could you please clarify - or point to a page that does so - about what it means to become a Mentor? Anyone have any experience being a mentor? I would be happy to help - but I would rather learn a bit more about the mentor side of GSoC You could have a look at the FAQ or the GSoC pages http://code.google.com/soc/2008/ and http://code.google.com/soc/2008/faqs.html respectively. Or join the #gsoc IRC channel on freenode. you could also contact some of the google folks they are very helpful if you have questions beyond the FAQ. (watch out for lh in the IRC channel) best regards, simon Also, any thoughts on what we might want someone to do? I think it would be great to have someone implement one of the algorithms on our wiki. I think just implementing one of the algorithms might help Mahout but it might be a bit hard to attract students to do that without some real task at hand. What about putting up tasks that solve problems e.g. from this years KDD cup or the web spam challenge? Than the benefit for participants would be two fold - first they would help Mahout and second they could compete with others in the field. Isabel -- A man's best friend is his dogma. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED]
Re: Google Summer of Code
On Wednesday 05 March 2008, Simon Willnauer wrote: You could have a look at the FAQ or the GSoC pages http://code.google.com/soc/2008/ and http://code.google.com/soc/2008/faqs.html respectively. Hmm, there is little about what mentors are expected apart from the following rather general question, is there? | 2. What is the role of a mentoring organization? If we want to take part in GSoC, from that question, I guess we need a little more than only mentors: | A pool of project ideas for students to choose from. Grant already asked for ideas. | An organization administrator to act as the project's main point of contact | for Google; Any volunteers? | A person or group responsible for review and ranking of student | applications, I'd be happy to help out here. Anyone else? | A person or group of people responsible for monitoring the progress of each | accepted student and to mentor her/him as the project progresses; + backup That would be the mentors Grant already mentioned. | A written evaluation of each student participant, including how s/he worked | with the group, whether s/he should be invited back should we do another | Google Summer of Code, etc. I guess this could be done by each member but should be reviewed by more than one person, as it looks like the evaluations are going to be highly subjective. Or join the #gsoc IRC channel on freenode. Sorry, but working on Mahout only after work I usually do not have the time to follow irc channels :( Anyone here, who already took part in GSoC and could give us a little summary of her experiences? Is it possible to do the mentoring job in the freetime after work or should one better plan more time than that? Isabel -- Remember kids, if there's a loaded gun in the room, be sure that you're the one holding it -- Captain Combat |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: Google Summer of Code
Hey everyone- I've been watching the mailing list for a little while now, hoping to contribute once I became more familiar, but I wanted to jump in here now and express my interest in the Summer of Code project. I'm currently a graduate student in electrical engineering at UT-Austin working in computer vision, which is closely tied to many of the problems Mahout is addressing (especially in my area of content-based retrieval). What can I do to help out? I've discussed some potential Mahout projects with another student recently- mostly focused around approximate k-means algorithms (since that's a problem I've been working on lately). It sounds like you guys are already implementing canopy clustering for k-means- Is there any interest in developing another approximation algorithm based on randomized kd-trees for high dimensional data? What about mean-shift clustering? Again, I would be glad to help in any way I can. Matt On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost [EMAIL PROTECTED] wrote: On Saturday 01 March 2008, Grant Ingersoll wrote: Also, any thoughts on what we might want someone to do? I think it would be great to have someone implement one of the algorithms on our wiki. Just as a general note, the deadline for applications: March 12: Mentoring organization application deadline (12 noon PDT/19:00 UTC). I suppose we should identify interesing tasks until that deadline. As a general guideline for mentors and for project proposals: http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors Isabel -- Better late than never. -- Titus Livius (Livy) |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://[EMAIL PROTECTED]
Re: Google Summer of Code
Well, here's your chance. Make a proposal of something you would like to work on that fits with what we are doing and we'll discuss it and possibly put it up as a project. I think it would be great if anyone took on something like M/R SVM implementation, or one of the other ones that is not already under way. -Grant On Mar 1, 2008, at 2:08 AM, [EMAIL PROTECTED] wrote: Hi Gang, I think we should put in for this: http://wiki.apache.org/general/SummerOfCode2008 I would be there are some students interested in doing ML on Hadoop. Yes. I would be happy to work :) Didn't know that Mahout is also participating in SoC. Any of the other committers willing to mentor? I am, but would also like some others to help out if you have the time. See http://wiki.apache.org/general/SummerOfCodeMentor . Thanks, Grant
Re: Google Summer of Code
Also, any thoughts on what we might want someone to do? I think it would be great to have someone implement one of the algorithms on our wiki. -Grant On Feb 29, 2008, at 9:33 PM, Grant Ingersoll wrote: Hi Gang, I think we should put in for this: http://wiki.apache.org/general/SummerOfCode2008 I would be there are some students interested in doing ML on Hadoop. Any of the other committers willing to mentor? I am, but would also like some others to help out if you have the time. See http://wiki.apache.org/general/SummerOfCodeMentor . Thanks, Grant
Re: Google Summer of Code
Hi Gang, I think we should put in for this: http://wiki.apache.org/general/SummerOfCode2008 I would be there are some students interested in doing ML on Hadoop. Yes. I would be happy to work :) Didn't know that Mahout is also participating in SoC. Any of the other committers willing to mentor? I am, but would also like some others to help out if you have the time. See http://wiki.apache.org/general/SummerOfCodeMentor . Thanks, Grant