[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113112#comment-15113112 ]
Thang Nguyen edited comment on FLINK-1733 at 1/22/16 9:26 PM: -------------------------------------------------------------- Hi [~till.rohrmann]! I am a software engineer professionally, however I am new to Scala. I did learn some functional programming in undergrad, so the trickiest thing for me to wrap my head around is Scala's type system. For context: I have a naive PCA implementation and some trivial tests for it (using the method & test data from this paper: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf). Currently, the method accepts an {{Int}} (# of princ. components), and a {{DataSet[DenseVector]}}. For this implementation, I create a covariance matrix ({{BreezeMatrix}}) and call {{breeze.linalg.eigSym}} on that. Then I return the top N (user param) principal components as a {{DataSet[Vector]}}. I will be re-factoring/throwing out a lot of my code (except the tests), so I hesitate to show anything I've written just yet. Questions: Does the method signature make sense? What _exactly_ should I be returning? The concept of PCA is new to me but it sounds like I should be returning the top N vectors (based on their eigenvalues, ordered by significance). Should the output also be {{DataSet[DenseVector]}}? Pointers on how to implement sPCA? I have taken a cursory look at the rest of the ML library but I am still learning Scala. If you have any recommended resources on learning Scala (specifically the type system), I would also appreciate that. Thanks! Thang was (Author: thang): Hi [~till.rohrmann]! I am a software engineer professionally, however I am new to Scala. I did learn some functional programming in undergrad, so the trickiest thing for me to wrap my head around is Scala's type system. For context: I have a naive PCA implementation and some trivial tests for it (using the method & test data from this paper: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf). Currently, the method accepts an Int (# of princ. components), and a DataSet[DenseVector]. For this implementation, I create a covariance matrix (BreezeMatrix) and call breeze.linalg.eigSym on that. Then I return the top N (user param) principal components as a DataSet[Vector]. I will be re-factoring/throwing out a lot of my code (except the tests), so I hesitate to show anything I've written just yet. Questions: Does the method signature make sense? What _exactly_ should I be returning? The concept of PCA is new to me but it sounds like I should be returning the top N vectors (based on their eigenvalues, ordered by significance). Should the output also be DataSet[DenseVector]? Pointers on how to implement sPCA? I have taken a cursory look at the rest of the ML library but I am still learning Scala. If you have any recommended resources on learning Scala (specifically the type system), I would also appreciate that. Thanks! Thang > Add PCA to machine learning library > ----------------------------------- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Thang Nguyen > Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)