Hi, Here is my proposal. Hope you can give me some advice. Thanks a lot! *Overview* Among those ten machine learning algorithms mentioned by Cheng-Tao Chu et al.[1], I'm really interested in Logistic Regression(LR). I would like to implement a LR program hadoop which can classify both binary and multiple classes. Apparently, LR is an effective and robust algorithm as well-known as Naive Bayes. Its principle is quite clear and formula is simple. So it's widely used. However, its time complexity is comparable higher than other algorithms, and we should deal all potential error which can occur during calculating inversion matrix. If LR can be integrated in Mahout which can take advantage of MapReduce's quick speed, then so many machine learning researchers can save their times to build a time-consuming program just for benchmark.
*Project* Logistic Regression has been widely used in binary classification. And many new brilliant learning algorithms are based on LR. So I would not only implement the basic edition of LR, actually, I have to design the architecture carefull so that it can change its core function flexibly, and can receive cost-sensitive data(with 'cost' feature), and may easily adapt to other framework. [1] can be a good reference framework. For more detail information, in my implementation of LR, It will surpport the following features: - *basic binary classification.* This function is almost same with the one described in [2]. It receives data with the input format like SVMlight[3], and output a model for the training set. Then use it to predict test data. - *simplified binary classification* In practical, many experiments use LR as a benchmark, and don't need to calculate Hessian matrix in procedure of iteration. Actually, we often use a small constant float *h* which is regarded as step length to substitute Hessian matrix. This is because calculating Hessian matrix (this procedure includes several times of multiplication of large matrices which complexity is really high) is the most time-consuming part of the whole algorithm. Also, due to the precision problem, inversion matrix sometimes can be out-of-control in thousands times of iteration, while using a small constant float in those cases can be safer and more accurate. However, how to specify *h *is a problem. Usually, we estimate it so that the change to vector will not too large, and then try some *h*manually. Of course we can leave this specification job to user, that is to provide a interface for this parameter. But we can automatically estimate proper scope of h in maybe the first five iterations and pick one randomly. - *Automactic stop* Like the previous feature, this is for user's convenience. When the change of vector is tiny enough, the iteration should stop. - *Cost-sensitive data* In some special cases, we may not treat every training sample equally, so the more important sample which is assigned larger weight can have bigger influence to predictor. This function is easy to implement but it plays an important role when it is integrated into other advanced framework. - *Multiclass case* Multiclass classification's principle is similar to binary classification of LR. But it will have more applications in real world. The time complexity of this is K^3 times of binary one, in which K indicates the number of classes. All the features listed above are belong to 'traditional' LR, in my opinion. As far as I know, LR can have some other applications in the following cases: - *Discriminative Learning* In paper[7], authors proposed a novel algorithms that use both training and test data to build a better predictor. Their intention is to use unlabeled test data to correct the difference between the distributions of training and test data. This phenomenon that training and test data are under different distributions is quit common, and it's occurred by Sample Selection Bias[6]. Especially in cross-domain experiments, using discriminative can improve LR's performance obviously. So I would like to add this function to LR. - *Positive Only Learning* In this case, training data consists of labeled data which is all positive examples and unlabeled data. This situation may occur when getting a comprehensive negative training data is too expensive. For example, if users want to classify a kind of web pages such as 'homepage' from the all the web pages, they may just simply submit some homepages which is the positive training data. So we get only positive labeled data and a large number of unlabeled data. Also, POL is very useful in filtering spam-email. Because the definition of 'spam' is subjective to people, every spam mail from one user can be considered as a positive data, and mails from all users can be seen as unlabeled data. So we needn't bother to find exactly ham mails for each user without worry one's ham maybe other's spam. There have some effective way to solve this problem such as [8]. I can use the similar algorithm to extract key negative features and then use them to classify. These two expansions to LR are important but not the least. I hope my implementation of LR can give users as much convience and options as possible. Besides, I think that during this project, I may help on some related work: - *Matrix calculation* LR needs lots of matrix calculation. From Jira of mahout, I find that some matrix related work has begun. I'll be very glad if I can participate in this field. LR uses two major functions of matrix calculation: multiplication and inversion. For multiplication, we can use a simple *divide and conquer*algorithm to reduce time complexity from O(n^3) to about O(n^ 2.81) if both matrices' dimension is n*n[4]. Although it's not fastest, it's really easier-implementing than Fast Fourier Transform. And as for inversion, the precision is very crucial. Basically, I'll use LU decomposition which is widely used by many math software. But with the detail such as how to dealing with devide-by-zero, I'll consult to others. - *Generalized linear model* Logistic regression is one of a class of models known as generalized linear models. When I told my idea of implementing LR to mahout members, they suggested me to design my architecture carefully so that can implement various kinds of linear regression such as Poisson all in one step. So, besides LR, I'm willing to implement some other linear regression algorithms. *Functional Milestones* The LR classifier will be able to do basic binary classification described in [2] The LR classifier will be able to do multiclass classification. The LR classifier will be able to automatically stop and estimate *h* Implement Poisson and some other Generalized linear models. Implement Descriminative Learning. Implement Positive Only Learning. *Biography* I'm an undergraduate student in Shanghai Jiao Tong University. My major is computer science, and have taken most of key courses, such as Probability, Algorithms, UML&Design Pattern etc. My programming ability is quite good. In 2005 and 2006, I participated in ACM/ICPC and got championship in Korea and Vietnam sites respectively. Now I'm a research member of Apex Lab in SJTU and I'm interested in machine learning, especially classification[9]. Though I'm not familiar with open software development, I'll try my best to finish this job. *Reference* [1]http://www.stat.columbia.edu/~gelman/standardize/<http://www.stat.columbia.edu/%7Egelman/standardize/> [2]Ch.T.Ch, S.K.Kim, Y.A.Lin, Y.Y.Yu, G.Bradski, and A.Y.Ng. Map-Reduce for Machine Learning on Multicore.Advances in Neural Information Processing Systems. [3]http://svmlight.joachims.org/ [4]S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani. *Algorithms.* 66-67*. * [5]J.J.Heckman. Sample selection bias as a specification error. * Econometrica*, 47:153-161,1979. [6]B.Zadrozny. Learning and evaluating classifiers under sample selection bias. In *Proceedings of the Twenty-First International Conference on Machine Learning*, pages 114-121,2004. [7]S Bickel, M Brückner, T Scheffer. Discriminative learning for dffering training and test distributions. In *Proceedings of the 24th international conference on Machine learning*, 2007. [8]H Yu, J Han, KCC Chang.PEBL: positive example based learning for Web page classification using SVM. In *Proceedings of the eighth ACM SIGKDD*.2002. [9]X.Lin, W.Y.Dai, Y.Jiang, Y.Qiang.Classifying Chinese Web Pages based on English Training Data.In *Proceedings of the 17th international World Wide Web* *Conference.*2008 Yun Jiang.