[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113112#comment-15113112
 ] 

Thang Nguyen edited comment on FLINK-1733 at 1/22/16 9:26 PM:
--------------------------------------------------------------

Hi [~till.rohrmann]! 

I am a software engineer professionally, however I am new to Scala. I did learn 
some functional programming in undergrad, so the trickiest thing for me to wrap 
my head around is Scala's type system. 

For context: 
I have a naive PCA implementation and some trivial tests for it (using the 
method & test data from this paper: 
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf). 

Currently, the method accepts an {{Int}} (# of princ. components), and a 
{{DataSet[DenseVector]}}.

For this implementation, I create a covariance matrix ({{BreezeMatrix}}) and 
call {{breeze.linalg.eigSym}} on that. 
Then I return the top N (user param) principal components as a 
{{DataSet[Vector]}}.
I will be re-factoring/throwing out a lot of my code (except the tests), so I 
hesitate to show anything I've written just yet.

Questions:
Does the method signature make sense? 
What _exactly_ should I be returning?  The concept of PCA is new to me but it 
sounds like I should be returning the top N vectors (based on their 
eigenvalues, ordered by significance). 
Should the output also be {{DataSet[DenseVector]}}?
Pointers on how to implement sPCA? 

I have taken a cursory look at the rest of the ML library but I am still 
learning Scala. 
If you have any recommended resources on learning Scala (specifically the type 
system), I would also appreciate that. 

Thanks! 

Thang


was (Author: thang):
Hi [~till.rohrmann]! 

I am a software engineer professionally, however I am new to Scala. I did learn 
some functional programming in undergrad, so the trickiest thing for me to wrap 
my head around is Scala's type system. 

For context: 
I have a naive PCA implementation and some trivial tests for it (using the 
method & test data from this paper: 
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf). 

Currently, the method accepts an Int (# of princ. components), and a 
DataSet[DenseVector].

For this implementation, I create a covariance matrix (BreezeMatrix) and call 
breeze.linalg.eigSym on that. 
Then I return the top N (user param) principal components as a DataSet[Vector].
I will be re-factoring/throwing out a lot of my code (except the tests), so I 
hesitate to show anything I've written just yet.

Questions:
Does the method signature make sense? 
What _exactly_ should I be returning?  The concept of PCA is new to me but it 
sounds like I should be returning the top N vectors (based on their 
eigenvalues, ordered by significance). 
Should the output also be DataSet[DenseVector]?
Pointers on how to implement sPCA? 

I have taken a cursory look at the rest of the ML library but I am still 
learning Scala. 
If you have any recommended resources on learning Scala (specifically the type 
system), I would also appreciate that. 

Thanks! 

Thang

> Add PCA to machine learning library
> -----------------------------------
>
>                 Key: FLINK-1733
>                 URL: https://issues.apache.org/jira/browse/FLINK-1733
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Thang Nguyen
>            Priority: Minor
>              Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to