Re: Interesting at GSoC: Mahout-343

Grant Ingersoll Sun, 28 Mar 2010 05:01:30 -0700

On Mar 28, 2010, at 12:52 AM, 杨杰 wrote:

> Dear Mahout Developers,
> 
> I'm Yang Jie, a MSc student in Computer Science from China. I am eager to
> apply for the project of Implement Integration of Mahout Clustering or
> Classification with Apache Solr[Mahout-343].
> 
> I am very interested in large-scale machine learning – also one of the
> directions of my group - and indexing in the information retrieval. That is
> the reason why I chose the large scaled topical partitional indexing as my
> graduates' dissertation. As a result, when found it, I was quite attracted!
> It is related to my work so that I could pay enough time into this project.
> If get this honor, I will try my best to make it as pretty as I can.
> 
> My main purpose about this project is to add a classification algorithm to
> the index module to Solr, if I had understood the description correctly. The
> main target to use the plugin on of my plan will focus on Solr's indexing
> module. That means, tests of my plugin will be on this module firstly. I
> have now read the code of lucene, tested the Mahout and indexing of lucene
> on Map/Reduce and had a preliminary understand upon Solr. What I am doing
> now is gathering the data structure and plugin information of Solr.
> 
> Currently, there is still some questions in my mind:
> 
>   1.
> 
>   Should I impletement a plugin to Solr which could handle any of the
>   classification algorithms in Mahout based on the data schema, or is it a
>   plugin only for one of the classification algorithms? This is what I didn't
>   understand from the name of the project(sorry).


I think a general solution is better, but it will likely make sense in your 
planning to pick one for the first phase of your project and get it working and 
then work w/ the Solr/Mahout community to generalize it.


>   2.
> 
>   I've now run some algorithms in Mahout on the Map/Reduce cluster, and
>   tried Solr, but still lack of further information about this project. Then
>   how could I get start with it?


Here's the rough details I have in my head:

1. Training of the classifier is handled off line
2. Once trained, an UpdateProcessor is hooked into Solr such that as the 
document comes in, the update processor will take in the field from the 
document and come up w/ the appropriate classification(s) and then add those 
labels to a new, configurable field to be indexed

Adding a request handler that could kick off training, manage the model, etc. 
would also be useful.  Think about what you can get done in the time frame 
given.



> 
> I am now going on with the plugin introduction of Solr. If got your help, I
> will be quite encouraged. The project is a meaningful experience for me, and
> it attracts me to pay my energy on it. I will try my best to complete it.
> 
> Best wishes !
> 
> 
> -- 
> Yang Jie（杨杰）
> hi.baidu.com/thinkdifferent
> 
> Group of CLOUD, Xi'an Jiaotong University
> Department of Computer Science and Technology, Xi’an Jiaotong University
> 
> PHONE: 86 1346888 3723
> TEL: 86 29 82665263 EXT. 608
> MSN: xtyangjie2...@yahoo.com.cn
> 
> once i didn't know software is not free; then i knew it days later; now i
> find it indeed free.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Interesting at GSoC: Mahout-343

Reply via email to