Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Reference Reading (https://cwiki.apache.org/confluence/display/MAHOUT/Reference+Reading)
Change Comment: --------------------------------------------------------------------- added a big collection of mailing list suggestions (re-using this page, rather than starting background-materials) Edited by Dan Brickley: --------------------------------------------------------------------- h1. General Clustering h2. Discussions * http://www.lucidimagination.com/search/document/1c3561d17fc1b81c/clustering_techniques_tips_and_tricks h1. Text Clustering h2. Clustering as part of Search * See Chapters on Hierarchical and Flat Clustering as part of search in http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html h1. Suggestions from the mailing list [http://mail-archives.apache.org/mod_mbox/mahout-user/201103.mbox/%[email protected]%3E Q:] Can someone recommend me good books on Statistics and also on Linear Algebra and Analytic Geometry which will provide enough background for understanding machine learning algorithms? The answers below focus on general background knowledge, rather than specifics of Mahout and associated Apache tooling. Feel free to add useful resources (books, but also videos, online courseware, tools), particularly those that are available free online. This page originated in an email thread, and its different contributors might not all agree on the best approach (and they might not know what's best for any given learner), but the resources here should give some idea of suitable background reading. Check the mailing list [http://mail-archives.apache.org/mod_mbox/mahout-user/ archives] if you care to figure out who-said-what, or find other suggestions. But don't be overwhelmed by all the maths, you can do a lot in Mahout with some basic knowledge. The resources given here will help you understand your data better, and ask better questions both of Mahout's APIs, and also of the Mahout community. And unlike learning some particular software package, these are skills that will be useful decades later. BOOKS and supporting materials on statistics, machine learning, etc.: Gilbert Strang's http://www-math.mit.edu/~gs "Introduction to Linear Algebra" http://math.mit.edu/linearalgebra/ (full text online, highly recommended by several on the mahout list). http://openlibrary.org/works/OL3285486W/Introduction_to_linear_algebra His lectures are also available online: http://web.mit.edu/18.06/www/ http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/ "Mathematical Tools for Applied Mulitvariate Analysis" by J.Douglass Carroll. http://www.amazon.com/Mathematical-Tools-Applied-Multivariate-Analysis/dp/0121609553/ref=sr_1_1?ie=UTF8&qid=1299602805&sr=8-1 Stanford Machine Learning online courseware (cs229.stanford.edu): http://www.stanford.edu/class/cs229/ "It's a very nicely taught course with super helpful lecture notes - and you can get all the videos in youtube or iTunesU.", http://itunes.apple.com/itunes-u/machine-learning/id384233048 The section notes for this course - http://www.stanford.edu/class/cs229/materials.html - will give you enough review material on linear algebra and probability theory to get you going. MIT Machine Learning online courseware (6.867): http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/ Lecture notes (PDFs): http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/ As a pre-requisite to probability and statistics, you'll need basic calculus. A maths for scientists text might be useful here such as, - Mathematics for Engineers and Scientists, Alan Jeffrey, Chapman & Hall/CRC. http://openlibrary.org/books/OL3305993M/Mathematics_for_engineers_and_scientists One of the best writers in the probability/statistics world is Sheldon Ross. Try ''A First Course in Probability (8th Edition), Pearson'' and then move on to his ''Introduction to Probability Models (9th Edition), Academic Press.'' http://www.pearsonhighered.com/educator/product/First-Course-in-Probability-A/9780136033134.page http://www.amazon.com/Introduction-Probability-Models-Sixth-Sheldon/dp/0125984707 Some good introductory alternatives here are: Probability and Statistics (7th Edition), Jay L. Devore, Chapman. http://www.amazon.com/Probability-Statistics-Engineering-Sciences-InfoTrac/dp/0534399339 Probability and Statistical Inference (7th Edition), Hogg and Tanis, Pearson. http://www.amazon.com/Probability-Statistical-Inference-Robert-Hogg/dp/0132546086 Once you have a grasp of the basics then there are a slew of great texts that you might consult: for example, Statistical Inference, Casell and Berger, Duxbury/Thomson Learning. http://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126 Most statistics books will have some sort of introduction to Bayesian methods, consider a specialist text, e.g.: Introduction to Bayesian Statistics (2nd Edition), William H. Bolstad, Wiley. http://www.amazon.com/Introduction-Bayesian-Statistics-William-Bolstad/dp/0471270202 Then for the computational side of Bayesian (predominantly Markov chain Monte Carlo), e.g. Bolstad's Understanding Computational Bayesian Statistics, Wiley. http://www.amazon.com/Understanding-Computational-Bayesian-Statistics-Wiley/dp/0470046090 Then you might try the MCMM galacticos Bayesian Data Analysis, Gelman et al., Chapman &Hall/CRC http://www.stat.columbia.edu/~gelman/book/ On top of the books, R - http://en.wikipedia.org/wiki/R_(programming_language) - is an indispensable software tool for visualizing distributions and doing calculations (another viewpoint) For statistics related to machine learning, I would avoid normal statistical texts and go with these instead Pattern Recognition and Machine Learning by Chris Bishop [http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm] Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm] [http://www-stat.stanford.edu/~tibs/ElemStatLearn/] (full text online) matrix computations/decomposition/factorization etc.? How's this one? [http://www.amazon.com/gp/product/0801854148/ref=s9_simh_gw_p14_d0_i1?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-3&pf_rd_r=0ESQ3KDY8MJ1AWWG8PFR&pf_rd_t=101&pf_rd_p=470938811&pf_rd_i=507846] any idea? any other suggestion? I found the one by Peter V. O'Neil "Introduction to Linear Algebra", to be a great book for beginners (with some knowledge in calculus). It is not comprehensive, but, I believe, it will be a good place to start and the author starts by explaining the concepts with regards to vector spaces which I found to be a more natural way of explaining. http://www.amazon.com/Introduction-Linear-Algebra-Theory-Applications/dp/053400606X David S. Watkins "Fundamentals of Matrix Computations (Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts)" [http://www.amazon.com/Fundamentals-Matrix-Computations-Applied-Mathematics/dp/0470528338/] The Gollub / Van Loan text you mention is the classic text for numerical linear algebra. Can't go wrong with it. However, I'd also suggest you look at Nick Trefethen's "Numerical Linear Algebra". It's a bit more approachable for practitioners -- GVL is better suited for researchers. [http://people.maths.ox.ac.uk/trefethen/books.html] [http://people.maths.ox.ac.uk/trefethen/text.html] (with some online lecture notes) I think this is the most relevant book for matrix math on distributed systems: http://www.amazon.com/Numerical-Linear-Algebra-Lloyd-Trefethen/dp/0898713617 Many chapters on SVD, there are even chapters on Lanczos BTW what about R? There is literally tons of books in R series devoted to rather isolated problems but what would be a good crush course book? Ted Dunning: I have found that learning about R is a difficult thing. The best introduction I have seen is, paradoxically, not really a book about R and assumes a statistical mind-set that I disagree with. That introduction is in MASS [http://www.stats.ox.ac.uk/pub/MASS4/]. Other references also exist: [http://www.r-tutor.com/r-introduction] [http://cran.r-project.org/doc/manuals/R-intro.pdf] [http://faculty.washington.edu/tlumley/Rcourse/] In addition, you should see how to plot data well: [http://www.statmethods.net/advgraphs/trellis.html] [http://had.co.nz/ggplot2/] Generally, I learn more about R by watching people and reading code than by reading books. There are many small tricks like how to format data optimally, how to restructure data.frames, common ways to plot data, which libraries do what and so on that an introductory book cannot convey general principles that will see you through to success. For Javascript/Web plotting: [http://www.1stwebdesigner.com/css/top-jquery-chart-libraries-interactive-charts/] Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
