date:20150608

[jira] [Issue Comment Deleted] (FLINK-2183) TaskManagerFailsWithSlotSharingITCase fails.

2015-06-08 Thread Sachin Goel (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Goel updated FLINK-2183:
---
Comment: was deleted

(was: I'm not familiar with that. Is this expected? And is someone already 
working on it?)

> TaskManagerFailsWithSlotSharingITCase fails.
> 
>
> Key: FLINK-2183
> URL: https://issues.apache.org/jira/browse/FLINK-2183
> Project: Flink
>  Issue Type: Bug
>Reporter: Sachin Goel
>
> Travis build is failing on this test.
> Here is the log output: 
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/65871133/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2183) TaskManagerFailsWithSlotSharingITCase fails.

2015-06-08 Thread Sachin Goel (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578424#comment-14578424
 ] 

Sachin Goel commented on FLINK-2183:


I'm not familiar with that. Is this expected? And is someone already working on 
it?

> TaskManagerFailsWithSlotSharingITCase fails.
> 
>
> Key: FLINK-2183
> URL: https://issues.apache.org/jira/browse/FLINK-2183
> Project: Flink
>  Issue Type: Bug
>Reporter: Sachin Goel
>
> Travis build is failing on this test.
> Here is the log output: 
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/65871133/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2183) TaskManagerFailsWithSlotSharingITCase fails.

2015-06-08 Thread Sachin Goel (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578423#comment-14578423
 ] 

Sachin Goel commented on FLINK-2183:


I'm not familiar with that. Is this expected? And is someone already working on 
it?

> TaskManagerFailsWithSlotSharingITCase fails.
> 
>
> Key: FLINK-2183
> URL: https://issues.apache.org/jira/browse/FLINK-2183
> Project: Flink
>  Issue Type: Bug
>Reporter: Sachin Goel
>
> Travis build is failing on this test.
> Here is the log output: 
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/65871133/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2185) Rework semantics for .setSeed function of SVM

2015-06-08 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577893#comment-14577893
 ] 

Dmitriy Lyubimov commented on FLINK-2185:
-

is svm work in Flik non-linear or linear only?

> Rework semantics for .setSeed function of SVM
> -
>
> Key: FLINK-2185
> URL: https://issues.apache.org/jira/browse/FLINK-2185
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Theodore Vasiloudis
>Priority: Minor
>  Labels: ML
> Fix For: 0.10
>
>
> Currently setting the seed for the SVM does not have the expected result of 
> producing the same output for each run of the algorithm, potentially due to 
> the randomness in data partitioning.
> We should then rework the way the algorithm works to either ensure we get the 
> exact same results when the seed is set, or if that is not possible the 
> setSeed function should be removed.
> Also in the current implementation the default value of 0 for the seed means 
> that all runs for which we don't set the seed will get the same seed which is 
> not the intended behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577863#comment-14577863
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110143059
  
Hi @shghatge, you don't need to close a PR in order to update it.
You can simply update (push or push --force into) the branch from which you 
created the PR and Github will automatically update the PR. This helps to have 
all comments about your implementation in one place.


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread fhueske

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110143059
  
Hi @shghatge, you don't need to close a PR in order to update it.
You can simply update (push or push --force into) the branch from which you 
created the PR and Github will automatically update the PR. This helps to have 
all comments about your implementation in one place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1601) Sometimes the YARNSessionFIFOITCase fails on Travis

2015-06-08 Thread Sachin Goel (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577841#comment-14577841
 ] 

Sachin Goel commented on FLINK-1601:


This test is failing again on travis build. The issue seems different however.
Here is the log:
https://api.travis-ci.org/jobs/65946506/log.txt?deansi=true


> Sometimes the YARNSessionFIFOITCase fails on Travis
> ---
>
> Key: FLINK-1601
> URL: https://issues.apache.org/jira/browse/FLINK-1601
> Project: Flink
>  Issue Type: Bug
>Reporter: Till Rohrmann
>Assignee: Robert Metzger
>
> Sometimes the {{YARNSessionFIFOITCase}} fails on Travis with the following 
> exception.
> {code}
> Tests run: 8, Failures: 8, Errors: 0, Skipped: 0, Time elapsed: 71.375 sec 
> <<< FAILURE! - in org.apache.flink.yarn.YARNSessionFIFOITCase
> perJobYarnCluster(org.apache.flink.yarn.YARNSessionFIFOITCase)  Time elapsed: 
> 60.707 sec  <<< FAILURE!
> java.lang.AssertionError: During the timeout period of 60 seconds the 
> expected string did not show up
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.apache.flink.yarn.YarnTestBase.runWithArgs(YarnTestBase.java:315)
>   at 
> org.apache.flink.yarn.YARNSessionFIFOITCase.perJobYarnCluster(YARNSessionFIFOITCase.java:185)
> testQueryCluster(org.apache.flink.yarn.YARNSessionFIFOITCase)  Time elapsed: 
> 0.507 sec  <<< FAILURE!
> java.lang.AssertionError: There is at least one application on the cluster is 
> not finished
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.flink.yarn.YarnTestBase.checkClusterEmpty(YarnTestBase.java:146)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
> {code}
> The result is
> {code}
> Failed tests: 
>   YARNSessionFIFOITCase.perJobYarnCluster:185->YarnTestBase.runWithArgs:315 
> During the timeout period of 60 seconds the expected string did not show up
>   YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least 
> one application on the cluster is not finished
>   YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least 
> one application on the cluster is not finished
>   YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least 
> one application on the cluster is not finished
>   YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least 
> one application on the cluster is not finished
>   YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least 
> one application on the cluster is not finished
>   YARNSessi

[jira] [Updated] (FLINK-2187) KMeans clustering is not present in release-0.9-rc1

2015-06-08 Thread Sachin Goel (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Goel updated FLINK-2187:
---
Affects Version/s: 0.9

> KMeans clustering is not present in release-0.9-rc1
> ---
>
> Key: FLINK-2187
> URL: https://issues.apache.org/jira/browse/FLINK-2187
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 0.9
>Reporter: Sachin Goel
>
> The Flink ml readme 
> [https://github.com/apache/flink/tree/release-0.9-rc1/flink-staging/flink-ml] 
> contains kmeans clustering in its description. However, the kmeans is not 
> part of the ML library yet. It is still only present in examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2187) KMeans clustering is not present in release-0.9-rc1

2015-06-08 Thread Sachin Goel (JIRA)

Sachin Goel created FLINK-2187:
--

 Summary: KMeans clustering is not present in release-0.9-rc1
 Key: FLINK-2187
 URL: https://issues.apache.org/jira/browse/FLINK-2187
 Project: Flink
  Issue Type: Bug
Reporter: Sachin Goel


The Flink ml readme 
[https://github.com/apache/flink/tree/release-0.9-rc1/flink-staging/flink-ml] 
contains kmeans clustering in its description. However, the kmeans is not part 
of the ML library yet. It is still only present in examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577588#comment-14577588
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110096269
  
No worries :) I am used to it. And I think it's good to ensure high code 
quality!


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
> Fix For: 0.10
>
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mjsax

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110096269
  
No worries :) I am used to it. And I think it's good to ensure high code 
quality!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Updated] (FLINK-1916) EOFException when running delta-iteration job

2015-06-08 Thread Maximilian Michels (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels updated FLINK-1916:
--
Affects Version/s: 0.10

> EOFException when running delta-iteration job
> -
>
> Key: FLINK-1916
> URL: https://issues.apache.org/jira/browse/FLINK-1916
> Project: Flink
>  Issue Type: Bug
>  Components: Local Runtime
>Affects Versions: 0.9, 0.10
> Environment: 0.9-milestone-1
> Exception on the cluster, local execution works
>Reporter: Stefan Bunk
>
> The delta-iteration program in [1] ends with an
> java.io.EOFException
>   at 
> org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333)
>   at 
> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140)
>   at 
> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeByte(AbstractPagedOutputView.java:223)
>   at 
> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeLong(AbstractPagedOutputView.java:291)
>   at 
> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeDouble(AbstractPagedOutputView.java:307)
>   at 
> org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:62)
>   at 
> org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:26)
>   at 
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:89)
>   at 
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:29)
>   at 
> org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:219)
>   at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536)
>   at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
>   at 
> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
>   at 
> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
>   at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
>   at 
> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
>   at java.lang.Thread.run(Thread.java:745)
> For logs and the accompanying mailing list discussion see below.
> When running with slightly different memory configuration, as hinted on the 
> mailing list, I sometimes also get this exception:
> 19.Apr. 13:39:29 INFO  Task - IterationHead(WorksetIteration 
> (Resolved-Redirects)) (10/10) switched to FAILED : 
> java.lang.IndexOutOfBoundsException: Index: 161, Size: 161
> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
> at java.util.ArrayList.get(ArrayList.java:411)
> at 
> org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.resetTo(InMemoryPartition.java:352)
> at 
> org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.access$100(InMemoryPartition.java:301)
> at 
> org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:226)
> at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536)
> at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
> at 
> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
> at 
> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
> at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
> at 
> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
> at java.lang.Thread.run(Thread.java:745)
> [1] Flink job: https://gist.github.com/knub/0bfec859a563009c1d57
> [2] Job manager logs: https://gist.github.com/knub/01e3a4b0edb8cde66ff4
> [3] One task manager's logs: https://gist.github.com/knub/8f2f953da95c8d7adefc
> [4] 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/EOFException-when-running-Flink-job-td1092.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread Maximilian Michels (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels resolved FLINK-2174.
---
   Resolution: Implemented
Fix Version/s: 0.10

> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
> Fix For: 0.10
>
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577487#comment-14577487
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user shghatge closed the pull request at:

https://github.com/apache/flink/pull/807


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread shghatge

Github user shghatge closed the pull request at:

https://github.com/apache/flink/pull/807


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread shghatge

Github user shghatge commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110078878
  
Then it was just removing vertices! Talk about swatting a Fly with a 
Sledgehammer! I will do all the changes you suggested. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577486#comment-14577486
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user shghatge commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110078878
  
Then it was just removing vertices! Talk about swatting a Fly with a 
Sledgehammer! I will do all the changes you suggested. 


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577482#comment-14577482
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/796


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/796


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577479#comment-14577479
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110076740
  
Thank you! Sorry for giving you a hard time.


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mxm

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110076740
  
Thank you! Sorry for giving you a hard time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [docs] Fix some typos and grammar in the Strea...

2015-06-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/806


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577454#comment-14577454
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110070080
  
Hi @shghatge ,

This very nice for a first PR and I am happy to see that you followed my 
guidelines :)
I left a set of comments in-line.

Apart from those:
- the difference method can be simplified. You don't need to filterOnEdges. 
Have a closer look at removeVertices. Imagine what happens if you remove a 
vertex, the edge will also have to be removed. You cannot leave an edge with 
just the source or the target vertex trailing.
- I think you forgot to add the corner case test for an input graph which 
does not have common vertices with the first one. I know you wrote it :) 
-  Finally, if you have a look at the Travis build here, it failed because 
you are indenting with spaces instead of tabs. You should play a bit with your 
IntelliJ settings. No worries! This is a rookie mistake, we all did it at 
first. To check everything is okay, just do a cd flink-staging/flink-gelly and 
then mvn verify. After it says build success, we're good to go. 
Rebase and update the PR.

If you have questions, I'll be more than happy to answer them!  Nice job!


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread andralungu

Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110070080
  
Hi @shghatge ,

This very nice for a first PR and I am happy to see that you followed my 
guidelines :)
I left a set of comments in-line.

Apart from those:
- the difference method can be simplified. You don't need to filterOnEdges. 
Have a closer look at removeVertices. Imagine what happens if you remove a 
vertex, the edge will also have to be removed. You cannot leave an edge with 
just the source or the target vertex trailing.
- I think you forgot to add the corner case test for an input graph which 
does not have common vertices with the first one. I know you wrote it :) 
-  Finally, if you have a look at the Travis build here, it failed because 
you are indenting with spaces instead of tabs. You should play a bit with your 
IntelliJ settings. No worries! This is a rookie mistake, we all did it at 
first. To check everything is okay, just do a cd flink-staging/flink-gelly and 
then mvn verify. After it says build success, we're good to go. 
Rebase and update the PR.

If you have questions, I'll be more than happy to answer them!  Nice job!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Updated] (FLINK-1877) Stalled Flink on Tez build

2015-06-08 Thread Robert Metzger (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-1877:
--
Component/s: Flink on Tez

> Stalled Flink on Tez build
> --
>
> Key: FLINK-1877
> URL: https://issues.apache.org/jira/browse/FLINK-1877
> Project: Flink
>  Issue Type: Bug
>  Components: Flink on Tez, Tests
>Affects Versions: master
>Reporter: Ufuk Celebi
>Assignee: Kostas Tzoumas
>
> https://travis-ci.org/uce/incubator-flink/jobs/57951373
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/57951373/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577432#comment-14577432
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31932709
  
--- Diff: 
flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java
 ---
@@ -266,6 +266,47 @@ public void testUnion() throws Exception {
"6,1,61\n";
}
 
+@Test
+public void testDifference() throws Exception {
+   /*
+* Test difference()
+*/
+final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+Graph graph = 
Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env),
+TestGraphUtils.getLongLongEdgeData(env), env);
+
+List> vertices = new ArrayList>();
+List> edges = new ArrayList>();
+
+vertices.remove(1);
+vertices.remove(3);
+vertices.remove(4);
+
+vertices.add(new Vertex(6L,6L));
+
+edges.remove(0);
+edges.remove(2);
+edges.remove(3);
+edges.remove(4);
+edges.remove(5);
+edges.remove(6);
+
+edges.add(new Edge(6L,1L,61L));
+edges.add(new Edge(6L,3L,63L));
+
+graph = graph.difference(Graph.fromCollection(vertices, edges, 
env));
+
+graph.getEdges().writeAsCsv(resultPath);
+graph.getVertices().writeAsCsv(resultPath);
--- End diff --

The graph.getVertices() should actually be in a different test; that way 
you could change the expected result and see that the vertices you get are 
actually the ones you expected.


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread andralungu

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31932709
  
--- Diff: 
flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java
 ---
@@ -266,6 +266,47 @@ public void testUnion() throws Exception {
"6,1,61\n";
}
 
+@Test
+public void testDifference() throws Exception {
+   /*
+* Test difference()
+*/
+final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+Graph graph = 
Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env),
+TestGraphUtils.getLongLongEdgeData(env), env);
+
+List> vertices = new ArrayList>();
+List> edges = new ArrayList>();
+
+vertices.remove(1);
+vertices.remove(3);
+vertices.remove(4);
+
+vertices.add(new Vertex(6L,6L));
+
+edges.remove(0);
+edges.remove(2);
+edges.remove(3);
+edges.remove(4);
+edges.remove(5);
+edges.remove(6);
+
+edges.add(new Edge(6L,1L,61L));
+edges.add(new Edge(6L,3L,63L));
+
+graph = graph.difference(Graph.fromCollection(vertices, edges, 
env));
+
+graph.getEdges().writeAsCsv(resultPath);
+graph.getVertices().writeAsCsv(resultPath);
--- End diff --

The graph.getVertices() should actually be in a different test; that way 
you could change the expected result and see that the vertices you get are 
actually the ones you expected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2161) Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577425#comment-14577425
 ] 

ASF GitHub Bot commented on FLINK-2161:
---

Github user nikste commented on the pull request:

https://github.com/apache/flink/pull/805#issuecomment-110064581
  
You are right. I only tried it with the local execution environment. It 
does not work if you start a separate job manager on your machine. 


> Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)
> --
>
> Key: FLINK-2161
> URL: https://issues.apache.org/jira/browse/FLINK-2161
> Project: Flink
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>Assignee: Nikolaas Steenbergen
>
> Currently, there is no easy way to load and ship external libraries/jars with 
> Flink's Scala shell. Assume that you want to run some Gelly graph algorithms 
> from within the Scala shell, then you have to put the Gelly jar manually in 
> the lib directory and make sure that this jar is also available on your 
> cluster, because it is not shipped with the user code. 
> It would be good to have a simple mechanism how to specify additional jars 
> upon startup of the Scala shell. These jars should then also be shipped to 
> the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2161] modified Scala shell start script...

2015-06-08 Thread nikste

Github user nikste commented on the pull request:

https://github.com/apache/flink/pull/805#issuecomment-110064581
  
You are right. I only tried it with the local execution environment. It 
does not work if you start a separate job manager on your machine. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577419#comment-14577419
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31932048
  
--- Diff: 
flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java
 ---
@@ -266,6 +266,47 @@ public void testUnion() throws Exception {
"6,1,61\n";
}
 
+@Test
+public void testDifference() throws Exception {
+   /*
+* Test difference()
+*/
+final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+Graph graph = 
Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env),
+TestGraphUtils.getLongLongEdgeData(env), env);
+
+List> vertices = new ArrayList>();
+List> edges = new ArrayList>();
+
+vertices.remove(1);
+vertices.remove(3);
+vertices.remove(4);
+
+vertices.add(new Vertex(6L,6L));
+
+edges.remove(0);
+edges.remove(2);
+edges.remove(3);
+edges.remove(4);
+edges.remove(5);
+edges.remove(6);
--- End diff --

same for the edges


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread andralungu

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31932002
  
--- Diff: 
flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java
 ---
@@ -266,6 +266,47 @@ public void testUnion() throws Exception {
"6,1,61\n";
}
 
+@Test
+public void testDifference() throws Exception {
+   /*
+* Test difference()
+*/
+final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+Graph graph = 
Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env),
+TestGraphUtils.getLongLongEdgeData(env), env);
+
+List> vertices = new ArrayList>();
+List> edges = new ArrayList>();
--- End diff --

I would put these in TestGraphUtils, one remove is fine, but three can make 
the code  a bit difficult to read :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread andralungu

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31932048
  
--- Diff: 
flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java
 ---
@@ -266,6 +266,47 @@ public void testUnion() throws Exception {
"6,1,61\n";
}
 
+@Test
+public void testDifference() throws Exception {
+   /*
+* Test difference()
+*/
+final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+Graph graph = 
Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env),
+TestGraphUtils.getLongLongEdgeData(env), env);
+
+List> vertices = new ArrayList>();
+List> edges = new ArrayList>();
+
+vertices.remove(1);
+vertices.remove(3);
+vertices.remove(4);
+
+vertices.add(new Vertex(6L,6L));
+
+edges.remove(0);
+edges.remove(2);
+edges.remove(3);
+edges.remove(4);
+edges.remove(5);
+edges.remove(6);
--- End diff --

same for the edges


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577418#comment-14577418
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31932002
  
--- Diff: 
flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java
 ---
@@ -266,6 +266,47 @@ public void testUnion() throws Exception {
"6,1,61\n";
}
 
+@Test
+public void testDifference() throws Exception {
+   /*
+* Test difference()
+*/
+final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+Graph graph = 
Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env),
+TestGraphUtils.getLongLongEdgeData(env), env);
+
+List> vertices = new ArrayList>();
+List> edges = new ArrayList>();
--- End diff --

I would put these in TestGraphUtils, one remove is fine, but three can make 
the code  a bit difficult to read :)


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread andralungu

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31930744
  
--- Diff: 
flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/Graph.java ---
@@ -1233,6 +1233,34 @@ public void coGroup(Iterable> edge, 
Iterable> edgeToBeRe
return new Graph(unionedVertices, unionedEdges, 
this.context);
}
 
+/**
+ * Performs Difference on the vertices and edges sets of the 
inputgraphs
--- End diff --

"on the vertex and edge sets of the input graphs" 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread andralungu

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31930812
  
--- Diff: 
flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/Graph.java ---
@@ -1233,6 +1233,34 @@ public void coGroup(Iterable> edge, 
Iterable> edgeToBeRe
return new Graph(unionedVertices, unionedEdges, 
this.context);
}
 
+/**
+ * Performs Difference on the vertices and edges sets of the 
inputgraphs
+ * removes both vertices and edges with the vertex as a source/target
+ * @param graph the graph to perform differennce with
+ * @return a new graph
--- End diff --

a new graph where the common vertices and edges have been removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577404#comment-14577404
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31930812
  
--- Diff: 
flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/Graph.java ---
@@ -1233,6 +1233,34 @@ public void coGroup(Iterable> edge, 
Iterable> edgeToBeRe
return new Graph(unionedVertices, unionedEdges, 
this.context);
}
 
+/**
+ * Performs Difference on the vertices and edges sets of the 
inputgraphs
+ * removes both vertices and edges with the vertex as a source/target
+ * @param graph the graph to perform differennce with
+ * @return a new graph
--- End diff --

a new graph where the common vertices and edges have been removed


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577403#comment-14577403
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31930744
  
--- Diff: 
flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/Graph.java ---
@@ -1233,6 +1233,34 @@ public void coGroup(Iterable> edge, 
Iterable> edgeToBeRe
return new Graph(unionedVertices, unionedEdges, 
this.context);
}
 
+/**
+ * Performs Difference on the vertices and edges sets of the 
inputgraphs
--- End diff --

"on the vertex and edge sets of the input graphs" 


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread andralungu

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31930244
  
--- Diff: docs/libs/gelly_guide.md ---
@@ -236,6 +236,8 @@ Graph networkWithWeights = 
network.joinWithEdgesOnSource(v
 
 * Union: Gelly's `union()` method performs a union on the 
vertex and edges sets of the input graphs. Duplicate vertices are removed from 
the resulting `Graph`, while if duplicate edges exists, these will be 
maintained.
 
+* Difference: Gelly's `difference()` method performs a 
difference on the vertex and edges sets of the input graphs. Common vertices 
are removed from the resulting `Graph`, along with the edges which which have 
these vertices as source/target.
--- End diff --

you have written which twice, "along with the edges which which" :) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577397#comment-14577397
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user andralungu commented on a diff in the pull request:

https://github.com/apache/flink/pull/807#discussion_r31930244
  
--- Diff: docs/libs/gelly_guide.md ---
@@ -236,6 +236,8 @@ Graph networkWithWeights = 
network.joinWithEdgesOnSource(v
 
 * Union: Gelly's `union()` method performs a union on the 
vertex and edges sets of the input graphs. Duplicate vertices are removed from 
the resulting `Graph`, while if duplicate edges exists, these will be 
maintained.
 
+* Difference: Gelly's `difference()` method performs a 
difference on the vertex and edges sets of the input graphs. Common vertices 
are removed from the resulting `Graph`, along with the edges which which have 
these vertices as source/target.
--- End diff --

you have written which twice, "along with the edges which which" :) 


> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread fobeligi

Github user fobeligi commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31927083
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Yes ^^


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577360#comment-14577360
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user fobeligi commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31927083
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Yes ^^


> Add Normaliser to ML library
> 
>
> Key: FLINK-1844
> URL: https://issues.apache.org/jira/browse/FLINK-1844
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Faye Beligianni
>Assignee: Faye Beligianni
>Priority: Minor
>  Labels: ML, Starter
>
> In many algorithms in ML, the features' values would be better to lie between 
> a given range of values, usually in the range (0,1) [1]. Therefore, a 
> {{Transformer}} could be implemented to achieve that normalisation.
> Resources: 
> [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31926077
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Package private should be ok, since the test is in the same package, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577346#comment-14577346
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31926077
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Package private should be ok, since the test is in the same package, right?


> Add Normaliser to ML library
> 
>
> Key: FLINK-1844
> URL: https://issues.apache.org/jira/browse/FLINK-1844
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Faye Beligianni
>Assignee: Faye Beligianni
>Priority: Minor
>  Labels: ML, Starter
>
> In many algorithms in ML, the features' values would be better to lie between 
> a given range of values, usually in the range (0,1) [1]. Therefore, a 
> {{Transformer}} could be implemented to achieve that normalisation.
> Resources: 
> [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31925630
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)]("/path/to/breast-cancer-wisconsin.data")
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains("?"))
+  .map{list =>
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM("path/to/a8a")
+val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
--- End diff --

Hmm for FlinkML it's probably ok to

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577343#comment-14577343
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31925630
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)]("/path/to/breast-cancer-wisconsin.data")
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains("?"))
+  .map{list =>
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We ca

[jira] [Commented] (FLINK-1877) Stalled Flink on Tez build

2015-06-08 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577342#comment-14577342
 ] 

Till Rohrmann commented on FLINK-1877:
--

Another instance of the issue I guess 
[https://travis-ci.org/tillrohrmann/flink/jobs/65882922] 
[https://flink.a.o.uce.east.s3.amazonaws.com/travis-artifacts/tillrohrmann/flink/507/507.5.tar.gz]

> Stalled Flink on Tez build
> --
>
> Key: FLINK-1877
> URL: https://issues.apache.org/jira/browse/FLINK-1877
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: master
>Reporter: Ufuk Celebi
>Assignee: Kostas Tzoumas
>
> https://travis-ci.org/uce/incubator-flink/jobs/57951373
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/57951373/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577329#comment-14577329
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user fobeligi commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31924947
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Hey, if the {{metricsOption}} field is package private then my tests will 
fail, cause I am also testing in the {{MinMaxScalerITSuite}} if the min, max of 
each feature has been calculated correct.


> Add Normaliser to ML library
> 
>
> Key: FLINK-1844
> URL: https://issues.apache.org/jira/browse/FLINK-1844
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Faye Beligianni
>Assignee: Faye Beligianni
>Priority: Minor
>  Labels: ML, Starter
>
> In many algorithms in ML, the features' values would be better to lie between 
> a given range of values, usually in the range (0,1) [1]. Therefore, a 
> {{Transformer}} could be implemented to achieve that normalisation.
> Resources: 
> [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread fobeligi

Github user fobeligi commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31924947
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Hey, if the {{metricsOption}} field is package private then my tests will 
fail, cause I am also testing in the {{MinMaxScalerITSuite}} if the min, max of 
each feature has been calculated correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577311#comment-14577311
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

GitHub user shghatge opened a pull request:

https://github.com/apache/flink/pull/807

[FLINK-2093][gelly] Added difference method

Tasks given on 5th June:
Add a difference function to the Graph.java
Modify the docs 'gelly-guide.md'
Add the test case for difference() method to GraphMutationsITCase.java

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shghatge/flink difference

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/807.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #807


commit 61afe247fb75fcfd22e0bdbed53a7dbbefdf65cb
Author: Shivani 
Date:   2015-06-08T14:58:22Z

[FLINK-2093][gelly] Added difference method




> Add a difference method to Gelly's Graph class
> --
>
> Key: FLINK-2093
> URL: https://issues.apache.org/jira/browse/FLINK-2093
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Shivani Ghatge
>Priority: Minor
>
> This method will compute the difference between two graphs, returning a new 
> graph containing the vertices and edges that the current graph and the input 
> graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread shghatge

GitHub user shghatge opened a pull request:

https://github.com/apache/flink/pull/807

[FLINK-2093][gelly] Added difference method

Tasks given on 5th June:
Add a difference function to the Graph.java
Modify the docs 'gelly-guide.md'
Add the test case for difference() method to GraphMutationsITCase.java

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shghatge/flink difference

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/807.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #807


commit 61afe247fb75fcfd22e0bdbed53a7dbbefdf65cb
Author: Shivani 
Date:   2015-06-08T14:58:22Z

[FLINK-2093][gelly] Added difference method




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577298#comment-14577298
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31922838
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Will make the field package private.


> Add Normaliser to ML library
> 
>
> Key: FLINK-1844
> URL: https://issues.apache.org/jira/browse/FLINK-1844
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Faye Beligianni
>Assignee: Faye Beligianni
>Priority: Minor
>  Labels: ML, Starter
>
> In many algorithms in ML, the features' values would be better to lie between 
> a given range of values, usually in the range (0,1) [1]. Therefore, a 
> {{Transformer}} could be implemented to achieve that normalisation.
> Resources: 
> [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31922838
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Will make the field package private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Updated] (FLINK-2185) Rework semantics for .setSeed function of SVM

2015-06-08 Thread Theodore Vasiloudis (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theodore Vasiloudis updated FLINK-2185:
---
Fix Version/s: 0.10

> Rework semantics for .setSeed function of SVM
> -
>
> Key: FLINK-2185
> URL: https://issues.apache.org/jira/browse/FLINK-2185
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Theodore Vasiloudis
>Priority: Minor
>  Labels: ML
> Fix For: 0.10
>
>
> Currently setting the seed for the SVM does not have the expected result of 
> producing the same output for each run of the algorithm, potentially due to 
> the randomness in data partitioning.
> We should then rework the way the algorithm works to either ensure we get the 
> exact same results when the seed is set, or if that is not possible the 
> setSeed function should be removed.
> Also in the current implementation the default value of 0 for the seed means 
> that all runs for which we don't set the seed will get the same seed which is 
> not the intended behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-2186) Rework SVM import to support very wide files

2015-06-08 Thread Theodore Vasiloudis (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theodore Vasiloudis updated FLINK-2186:
---
Fix Version/s: 0.10

> Rework SVM import to support very wide files
> 
>
> Key: FLINK-2186
> URL: https://issues.apache.org/jira/browse/FLINK-2186
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library, Scala API
>Reporter: Theodore Vasiloudis
> Fix For: 0.10
>
>
> In the current readVcsFile implementation, importing CSV files with many 
> columns can become from cumbersome to impossible.
> For example to import an 11 column file we need to write:
> {code}
> val cancer = env.readCsvFile[(String, String, String, String, String, String, 
> String, String, String, String, 
> String)]("/path/to/breast-cancer-wisconsin.data")
> {code}
> For many use cases in Machine Learning we might have CSV files with thousands 
> or millions of columns that we want to import as vectors.
> In that case using the current readCsvFile method becomes impossible.
> We therefore need to rework the current function, or create a new one that 
> will allow us to import CSV files with an arbitrary number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577292#comment-14577292
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31922171
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

As private state, the developer should be able to choose any type. Thus, a 
`BreezeVector` should be fine here. I was just wondering, whether a 
`DenseVector` does not make more sense here. Is it safe to assume that every 
feature has at least 2 non-zero values?


> Add Normaliser to ML library
> 
>
> Key: FLINK-1844
> URL: https://issues.apache.org/jira/browse/FLINK-1844
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Faye Beligianni
>Assignee: Faye Beligianni
>Priority: Minor
>  Labels: ML, Starter
>
> In many algorithms in ML, the features' values would be better to lie between 
> a given range of values, usually in the range (0,1) [1]. Therefore, a 
> {{Transformer}} could be implemented to achieve that normalisation.
> Resources: 
> [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31922171
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

As private state, the developer should be able to choose any type. Thus, a 
`BreezeVector` should be fine here. I was just wondering, whether a 
`DenseVector` does not make more sense here. Is it safe to assume that every 
feature has at least 2 non-zero values?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Updated] (FLINK-2186) Rework SVM import to support very wide files

2015-06-08 Thread Theodore Vasiloudis (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theodore Vasiloudis updated FLINK-2186:
---
Summary: Rework SVM import to support very wide files  (was: Reworj SVM 
import to support very wide files)

> Rework SVM import to support very wide files
> 
>
> Key: FLINK-2186
> URL: https://issues.apache.org/jira/browse/FLINK-2186
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library, Scala API
>Reporter: Theodore Vasiloudis
>
> In the current readVcsFile implementation, importing CSV files with many 
> columns can become from cumbersome to impossible.
> For example to import an 11 column file wee need to write:
> {code}
> val cancer = env.readCsvFile[(String, String, String, String, String, String, 
> String, String, String, String, 
> String)]("/path/to/breast-cancer-wisconsin.data")
> {code}
> For many use cases in Machine Learning we might have CSV files with thousands 
> or millions of columns that we want to import as vectors.
> In that case using the current readCsvFile method becomes impossible.
> We therefor need to rework the current function, or create a new one that 
> will allow us to import CSV files with an arbitrary number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2186) Reworj SVM import to support very wide files

2015-06-08 Thread Theodore Vasiloudis (JIRA)

Theodore Vasiloudis created FLINK-2186:
--

 Summary: Reworj SVM import to support very wide files
 Key: FLINK-2186
 URL: https://issues.apache.org/jira/browse/FLINK-2186
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library, Scala API
Reporter: Theodore Vasiloudis


In the current readVcsFile implementation, importing CSV files with many 
columns can become from cumbersome to impossible.

For example to import an 11 column file wee need to write:

{code}
val cancer = env.readCsvFile[(String, String, String, String, String, String, 
String, String, String, String, 
String)]("/path/to/breast-cancer-wisconsin.data")
{code}

For many use cases in Machine Learning we might have CSV files with thousands 
or millions of columns that we want to import as vectors.
In that case using the current readCsvFile method becomes impossible.

We therefor need to rework the current function, or create a new one that will 
allow us to import CSV files with an arbitrary number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-2186) Rework SVM import to support very wide files

2015-06-08 Thread Theodore Vasiloudis (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theodore Vasiloudis updated FLINK-2186:
---
Description: 
In the current readVcsFile implementation, importing CSV files with many 
columns can become from cumbersome to impossible.

For example to import an 11 column file we need to write:

{code}
val cancer = env.readCsvFile[(String, String, String, String, String, String, 
String, String, String, String, 
String)]("/path/to/breast-cancer-wisconsin.data")
{code}

For many use cases in Machine Learning we might have CSV files with thousands 
or millions of columns that we want to import as vectors.
In that case using the current readCsvFile method becomes impossible.

We therefore need to rework the current function, or create a new one that will 
allow us to import CSV files with an arbitrary number of columns.

  was:
In the current readVcsFile implementation, importing CSV files with many 
columns can become from cumbersome to impossible.

For example to import an 11 column file wee need to write:

{code}
val cancer = env.readCsvFile[(String, String, String, String, String, String, 
String, String, String, String, 
String)]("/path/to/breast-cancer-wisconsin.data")
{code}

For many use cases in Machine Learning we might have CSV files with thousands 
or millions of columns that we want to import as vectors.
In that case using the current readCsvFile method becomes impossible.

We therefore need to rework the current function, or create a new one that will 
allow us to import CSV files with an arbitrary number of columns.


> Rework SVM import to support very wide files
> 
>
> Key: FLINK-2186
> URL: https://issues.apache.org/jira/browse/FLINK-2186
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library, Scala API
>Reporter: Theodore Vasiloudis
>
> In the current readVcsFile implementation, importing CSV files with many 
> columns can become from cumbersome to impossible.
> For example to import an 11 column file we need to write:
> {code}
> val cancer = env.readCsvFile[(String, String, String, String, String, String, 
> String, String, String, String, 
> String)]("/path/to/breast-cancer-wisconsin.data")
> {code}
> For many use cases in Machine Learning we might have CSV files with thousands 
> or millions of columns that we want to import as vectors.
> In that case using the current readCsvFile method becomes impossible.
> We therefore need to rework the current function, or create a new one that 
> will allow us to import CSV files with an arbitrary number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-2186) Rework SVM import to support very wide files

2015-06-08 Thread Theodore Vasiloudis (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theodore Vasiloudis updated FLINK-2186:
---
Description: 
In the current readVcsFile implementation, importing CSV files with many 
columns can become from cumbersome to impossible.

For example to import an 11 column file wee need to write:

{code}
val cancer = env.readCsvFile[(String, String, String, String, String, String, 
String, String, String, String, 
String)]("/path/to/breast-cancer-wisconsin.data")
{code}

For many use cases in Machine Learning we might have CSV files with thousands 
or millions of columns that we want to import as vectors.
In that case using the current readCsvFile method becomes impossible.

We therefore need to rework the current function, or create a new one that will 
allow us to import CSV files with an arbitrary number of columns.

  was:
In the current readVcsFile implementation, importing CSV files with many 
columns can become from cumbersome to impossible.

For example to import an 11 column file wee need to write:

{code}
val cancer = env.readCsvFile[(String, String, String, String, String, String, 
String, String, String, String, 
String)]("/path/to/breast-cancer-wisconsin.data")
{code}

For many use cases in Machine Learning we might have CSV files with thousands 
or millions of columns that we want to import as vectors.
In that case using the current readCsvFile method becomes impossible.

We therefor need to rework the current function, or create a new one that will 
allow us to import CSV files with an arbitrary number of columns.


> Rework SVM import to support very wide files
> 
>
> Key: FLINK-2186
> URL: https://issues.apache.org/jira/browse/FLINK-2186
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library, Scala API
>Reporter: Theodore Vasiloudis
>
> In the current readVcsFile implementation, importing CSV files with many 
> columns can become from cumbersome to impossible.
> For example to import an 11 column file wee need to write:
> {code}
> val cancer = env.readCsvFile[(String, String, String, String, String, String, 
> String, String, String, String, 
> String)]("/path/to/breast-cancer-wisconsin.data")
> {code}
> For many use cases in Machine Learning we might have CSV files with thousands 
> or millions of columns that we want to import as vectors.
> In that case using the current readCsvFile method becomes impossible.
> We therefore need to rework the current function, or create a new one that 
> will allow us to import CSV files with an arbitrary number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2185) Rework semantics for .setSeed function of SVM

2015-06-08 Thread Theodore Vasiloudis (JIRA)

Theodore Vasiloudis created FLINK-2185:
--

 Summary: Rework semantics for .setSeed function of SVM
 Key: FLINK-2185
 URL: https://issues.apache.org/jira/browse/FLINK-2185
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor


Currently setting the seed for the SVM does not have the expected result of 
producing the same output for each run of the algorithm, potentially due to the 
randomness in data partitioning.

We should then rework the way the algorithm works to either ensure we get the 
exact same results when the seed is set, or if that is not possible the setSeed 
function should be removed.

Also in the current implementation the default value of 0 for the seed means 
that all runs for which we don't set the seed will get the same seed which is 
not the intended behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31921747
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These values are used 
in the transform step
+* to transform the given input data.
+*
+* @tparam T Input data type which is a subtype of [[Vector]]
+* @return
+*/
+  implicit def fitVectorMinMaxScaler[T <: Vector] = ne

[jira] [Commented] (FLINK-2182) Add stateful Streaming Sequence Source

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577285#comment-14577285
 ] 

ASF GitHub Bot commented on FLINK-2182:
---

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/804#issuecomment-110022371
  
I modified the ComplexIntegrationTest to make it work with the different 
ordering of the new checkpointable parallel sequence source. Maybe someone 
should have a look at it.


> Add stateful Streaming Sequence Source
> --
>
> Key: FLINK-2182
> URL: https://issues.apache.org/jira/browse/FLINK-2182
> Project: Flink
>  Issue Type: Improvement
>  Components: eaming, Streaming
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577286#comment-14577286
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31921747
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These

[GitHub] flink pull request: [FLINK-2182] Add stateful Streaming Sequence S...

2015-06-08 Thread aljoscha

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/804#issuecomment-110022371
  
I modified the ComplexIntegrationTest to make it work with the different 
ordering of the new checkpointable parallel sequence source. Maybe someone 
should have a look at it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577284#comment-14577284
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31921716
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31921716
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These values are used 
in the transform step
+* to transform the given input data.
+*
+* @tparam T Input data type which is a subtype of [[Vector]]
+* @return
--- End diff --

Well the function's return type is a `FitO

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577280#comment-14577280
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31921435
  
--- Diff: docs/libs/ml/minMax_scaler.md ---
@@ -0,0 +1,112 @@
+---
+mathjax: include
+htmlTitle: FlinkML - MinMax Scaler
+title: FlinkML - MinMax Scaler
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The MinMax scaler scales the given data set, so that all values will lie 
between a user specified range [min,max].
+ In case the user does not provide a specific minimum and maximum value 
for the scaling range, the MinMax scaler transforms the features of the input 
data set to lie in the [0,1] interval.
+ Given a set of input data $x_1, x_2,... x_n$, with minimum value:
+
+ $$x_{min} = min({x_1, x_2,..., x_n})$$
+
+ and maximum value:
+
+ $$x_{max} = max({x_1, x_2,..., x_n})$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min 
\right ) + min$$
+
+where $\textit{min}$ and $\textit{max}$ are the user specified minimum and 
maximum values of the range to scale.
+
+## Operations
+
+`MinMaxScaler` is a `Transformer`.
+As such, it supports the `fit` and `transform` operation.
+
+### Fit
+
+MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
+
+* `fit[T <: Vector]: DataSet[T] => Unit`
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Transform
+
+MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into 
the respective type:
+
+* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
+* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
+
+## Parameters
+
+The MinMax scaler implementation can be controlled by the following two 
parameters:
+
+ 
+  
+
+  Parameters
+  Description
+
+  
+
+  
+
+  Min
+  
+
+  The minimum value of the range for the scaled data set. (Default 
value: 0.0)
+
+  
+
+
+  Max
+  
+
+  The maximum value of the range for the scaled data set. (Default 
value: 1.0)
+
+  
+
+  
+
+
+## Examples
+
+{% highlight scala %}
+// Create MinMax scaler transformer
+val minMaxscaler = MinMaxScaler()
+.setMin(-1.0)
--- End diff --

Will address this when merging.


> Add Normaliser to ML library
> 
>
> Key: FLINK-1844
> URL: https://issues.apache.org/jira/browse/FLINK-1844
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Faye Beligianni
>Assignee: Faye Beligianni
>Priority: Minor
>  Labels: ML, Starter
>
> In many algorithms in ML, the features' values would be better to lie between 
> a given range of values, usually in the range (0,1) [1]. Therefore, a 
> {{Transformer}} could be implemented to achieve that normalisation.
> Resources: 
> [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31921435
  
--- Diff: docs/libs/ml/minMax_scaler.md ---
@@ -0,0 +1,112 @@
+---
+mathjax: include
+htmlTitle: FlinkML - MinMax Scaler
+title: FlinkML - MinMax Scaler
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The MinMax scaler scales the given data set, so that all values will lie 
between a user specified range [min,max].
+ In case the user does not provide a specific minimum and maximum value 
for the scaling range, the MinMax scaler transforms the features of the input 
data set to lie in the [0,1] interval.
+ Given a set of input data $x_1, x_2,... x_n$, with minimum value:
+
+ $$x_{min} = min({x_1, x_2,..., x_n})$$
+
+ and maximum value:
+
+ $$x_{max} = max({x_1, x_2,..., x_n})$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min 
\right ) + min$$
+
+where $\textit{min}$ and $\textit{max}$ are the user specified minimum and 
maximum values of the range to scale.
+
+## Operations
+
+`MinMaxScaler` is a `Transformer`.
+As such, it supports the `fit` and `transform` operation.
+
+### Fit
+
+MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
+
+* `fit[T <: Vector]: DataSet[T] => Unit`
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Transform
+
+MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into 
the respective type:
+
+* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
+* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
+
+## Parameters
+
+The MinMax scaler implementation can be controlled by the following two 
parameters:
+
+ 
+  
+
+  Parameters
+  Description
+
+  
+
+  
+
+  Min
+  
+
+  The minimum value of the range for the scaled data set. (Default 
value: 0.0)
+
+  
+
+
+  Max
+  
+
+  The maximum value of the range for the scaled data set. (Default 
value: 1.0)
+
+  
+
+  
+
+
+## Examples
+
+{% highlight scala %}
+// Create MinMax scaler transformer
+val minMaxscaler = MinMaxScaler()
+.setMin(-1.0)
--- End diff --

Will address this when merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577279#comment-14577279
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31921419
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
--- End diff --

You're right. Will add it when I merge it.


> Add Normaliser to ML library
> 
>
> Key: FLINK-1844
> URL: https://issues.apache.org/jira/browse/FLINK-1844
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Faye Beligianni
>Assignee: Faye Beligianni
>Priority: Minor
>  Labels: ML, Starter
>
> In many algorithms in ML, the features' values would be better to lie between 
> a given range of values, usually in the range (0,1) [1]. Therefore, a 
> {{Transformer}} could be implemented to achieve that normalisation.
> Resources: 
> [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31921419
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
--- End diff --

You're right. Will add it when I merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1981) Add GZip support

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577248#comment-14577248
 ] 

ASF GitHub Bot commented on FLINK-1981:
---

Github user sekruse commented on the pull request:

https://github.com/apache/flink/pull/762#issuecomment-110013765
  
Okay, will do that.


> Add GZip support
> 
>
> Key: FLINK-1981
> URL: https://issues.apache.org/jira/browse/FLINK-1981
> Project: Flink
>  Issue Type: New Feature
>  Components: Core
>Reporter: Sebastian Kruse
>Assignee: Sebastian Kruse
>Priority: Minor
> Fix For: 0.9
>
>
> GZip, as a commonly used compression format, should be supported in the same 
> way as the already supported deflate files. This allows to use GZip files 
> with any subclass of FileInputFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1981] add support for GZIP files

2015-06-08 Thread sekruse

Github user sekruse commented on the pull request:

https://github.com/apache/flink/pull/762#issuecomment-110013765
  
Okay, will do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-2184) Cannot get last element with maxBy/minBy

2015-06-08 Thread JIRA

Gábor Hermann created FLINK-2184:


 Summary: Cannot get last element with maxBy/minBy
 Key: FLINK-2184
 URL: https://issues.apache.org/jira/browse/FLINK-2184
 Project: Flink
  Issue Type: Improvement
  Components: Scala API, Streaming
Reporter: Gábor Hermann
Priority: Minor


In the streaming Scala API there is no method
{{maxBy(int positionToMaxBy, boolean first)}}
nor
{{minBy(int positionToMinBy, boolean first)}}
like in the Java API, where _first_ set to {{true}} indicates that the latest 
found element will return.

These methods should be added to the Scala API too, in order to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator

2015-06-08 Thread Aljoscha Krettek (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek closed FLINK-2054.
---
   Resolution: Fixed
Fix Version/s: 0.9

Resolved in 
https://github.com/apache/flink/commit/26304c20ab6aab33daba775736061102bd7a2409

> StreamOperator rework removed copy calls when passing output to a chained 
> operator
> --
>
> Key: FLINK-2054
> URL: https://issues.apache.org/jira/browse/FLINK-2054
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Gyula Fora
>Assignee: Aljoscha Krettek
>Priority: Blocker
> Fix For: 0.9
>
>
> Before the recent rework of stream operators to be push based, operators held 
> the semantics that any input (and also output to be specific) will not be 
> mutated afterwards. This was achieved by simply copying records that were 
> passed to other chained operators.
> This feature has been removed thus introducing a major break in the operator 
> mutability guarantees. 
> To make chaining viable in all cases (and to prevent hidden bugs) we need to 
> reintroduce the copying logic. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577229#comment-14577229
 ] 

ASF GitHub Bot commented on FLINK-2054:
---

Github user aljoscha closed the pull request at:

https://github.com/apache/flink/pull/803


> StreamOperator rework removed copy calls when passing output to a chained 
> operator
> --
>
> Key: FLINK-2054
> URL: https://issues.apache.org/jira/browse/FLINK-2054
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Gyula Fora
>Assignee: Aljoscha Krettek
>Priority: Blocker
>
> Before the recent rework of stream operators to be push based, operators held 
> the semantics that any input (and also output to be specific) will not be 
> mutated afterwards. This was achieved by simply copying records that were 
> passed to other chained operators.
> This feature has been removed thus introducing a major break in the operator 
> mutability guarantees. 
> To make chaining viable in all cases (and to prevent hidden bugs) we need to 
> reintroduce the copying logic. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...

2015-06-08 Thread aljoscha

Github user aljoscha closed the pull request at:

https://github.com/apache/flink/pull/803


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577224#comment-14577224
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110005377
  
Ok. So I remove the first if-then and update this PR.


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (FLINK-2000) Add SQL-style aggregations for Table API

2015-06-08 Thread Aljoscha Krettek (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek closed FLINK-2000.
---
   Resolution: Fixed
Fix Version/s: 0.9

Implemented in 
https://github.com/apache/flink/commit/7805db813dd744f13776320d556e1cefa0351464

> Add SQL-style aggregations for Table API
> 
>
> Key: FLINK-2000
> URL: https://issues.apache.org/jira/browse/FLINK-2000
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API
>Reporter: Aljoscha Krettek
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 0.9
>
>
> Right now, the syntax for aggregations is "a.count, a.min" and so on. We 
> could in addition offer "COUNT(a), MIN(a)" and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2000) Add SQL-style aggregations for Table API

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577223#comment-14577223
 ] 

ASF GitHub Bot commented on FLINK-2000:
---

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/782#issuecomment-110005322
  
I merged it manually but forgot to add the tag that closes this Pull 
Request. Could you please close it?

And also thanks for your work. :+1: Are you interested in doing a more 
involved extension of the Table API? I have several ideas that could be pursued.


> Add SQL-style aggregations for Table API
> 
>
> Key: FLINK-2000
> URL: https://issues.apache.org/jira/browse/FLINK-2000
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API
>Reporter: Aljoscha Krettek
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 0.9
>
>
> Right now, the syntax for aggregations is "a.count, a.min" and so on. We 
> could in addition offer "COUNT(a), MIN(a)" and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mjsax

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110005377
  
Ok. So I remove the first if-then and update this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577221#comment-14577221
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110005201
  
No need for a vote. I agree with you that it is better to have only one 
comment character. The cut command is also ok.


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2000] [table] Add sql style aggregation...

2015-06-08 Thread aljoscha

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/782#issuecomment-110005322
  
I merged it manually but forgot to add the tag that closes this Pull 
Request. Could you please close it?

And also thanks for your work. :+1: Are you interested in doing a more 
involved extension of the Table API? I have several ideas that could be pursued.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mxm

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110005201
  
No need for a vote. I agree with you that it is better to have only one 
comment character. The cut command is also ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577216#comment-14577216
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110004370
  
See your point. But we need to come to a conclusion.
  - I like my single line comment better (of course ;)) -- I thinks it's 
easier to read than any regex -- and I don't think, that the cut-process in an 
issue
  - the last command you suggested is taking the first matching token, 
thus, there are still multiple cut-off characters

Should we start a vote?


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mjsax

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110004370
  
See your point. But we need to come to a conclusion.
  - I like my single line comment better (of course ;)) -- I thinks it's 
easier to read than any regex -- and I don't think, that the cut-process in an 
issue
  - the last command you suggested is taking the first matching token, 
thus, there are still multiple cut-off characters

Should we start a vote?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [docs] Fix some typos and grammar in the Strea...

2015-06-08 Thread mbalassi

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/806#issuecomment-110001882
  
Thanks, @ggevay. I am adding my changes on top of yours. Documenting the 
state amongst others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577214#comment-14577214
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110002485
  
It can be a backwards-compatibility problem if users used some custom 
comment character. Probably negligible. Concerning the second if statement: 
Apparently, there was some feature along the lines of "domain/hostname". This 
is not documented at all and can be removed. We can create a separate JIRA for 
that.


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mxm

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110002485
  
It can be a backwards-compatibility problem if users used some custom 
comment character. Probably negligible. Concerning the second if statement: 
Apparently, there was some feature along the lines of "domain/hostname". This 
is not documented at all and can be removed. We can create a separate JIRA for 
that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2161) Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577207#comment-14577207
 ] 

ASF GitHub Bot commented on FLINK-2161:
---

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/805#issuecomment-110001500
  
I think this is only half of what is needed. The other half would be 
sending the jar files along when submitting the job with the modified 
RemoteEnvironment.


> Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)
> --
>
> Key: FLINK-2161
> URL: https://issues.apache.org/jira/browse/FLINK-2161
> Project: Flink
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>Assignee: Nikolaas Steenbergen
>
> Currently, there is no easy way to load and ship external libraries/jars with 
> Flink's Scala shell. Assume that you want to run some Gelly graph algorithms 
> from within the Scala shell, then you have to put the Gelly jar manually in 
> the lib directory and make sure that this jar is also available on your 
> cluster, because it is not shipped with the user code. 
> It would be good to have a simple mechanism how to specify additional jars 
> upon startup of the Scala shell. These jars should then also be shipped to 
> the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2161] modified Scala shell start script...

2015-06-08 Thread aljoscha

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/805#issuecomment-110001500
  
I think this is only half of what is needed. The other half would be 
sending the jar files along when submitting the job with the modified 
RemoteEnvironment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577203#comment-14577203
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110001000
  
And I don't think it a compatibility problem. I think, '#' is no valid 
characters for host names. Thus, nobody can have it in any slave file anyway.


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mjsax

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110001000
  
And I don't think it a compatibility problem. I think, '#' is no valid 
characters for host names. Thus, nobody can have it in any slave file anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mjsax

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-10868
  
I did not understand the purpose of the second if-then at all... But the 
first does not remove comment lines. My introduces code works perfectly fine. 
IMHO, it's cleaner code and I would remove the first if-then completely and 
replace it with my single-line-cut command.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577201#comment-14577201
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-10868
  
I did not understand the purpose of the second if-then at all... But the 
first does not remove comment lines. My introduces code works perfectly fine. 
IMHO, it's cleaner code and I would remove the first if-then completely and 
replace it with my single-line-cut command.


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mxm

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-109998768
  
My bad. Sorry for bothering. I had slightly modified the example. We could, 
however, just remove the if statement and it would work without creating an 
external process:
```bash
[[ "hello bli blub 123 353" =~ ^([0-9a-zA-Z/.-]+).*$ ]]
SLAVE=${BASH_REMATCH[1]}
```
I agree that it is cleaner to have only one comment character. Only 
disadvantage I see is that it breaks backwards-compatibility. The whole thing 
needs a bit of cleanup. For example, the "network topology" seems a bit weird 
and is probably a legacy feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577196#comment-14577196
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-109998768
  
My bad. Sorry for bothering. I had slightly modified the example. We could, 
however, just remove the if statement and it would work without creating an 
external process:
```bash
[[ "hello bli blub 123 353" =~ ^([0-9a-zA-Z/.-]+).*$ ]]
SLAVE=${BASH_REMATCH[1]}
```
I agree that it is cleaner to have only one comment character. Only 
disadvantage I see is that it breaks backwards-compatibility. The whole thing 
needs a bit of cleanup. For example, the "network topology" seems a bit weird 
and is probably a legacy feature.


> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31914466
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Not right now, so these can remain. I was mostly concerned that this 
parameter was user-facing, meaning the user had to provide Breeze vectors as 
parameters, but that is not the case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577193#comment-14577193
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31914466
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,254 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import breeze.linalg.{max, min}
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = [0,1].
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
--- End diff --

Not right now, so these can remain. I was mostly concerned that this 
parameter was user-facing, meaning the user had to provide Breeze vectors as 
parameters, but that is not the case.


> Add Normaliser to ML library
> 
>
> Key: FLINK-1844
> URL: https://issues.apache.org/jira/browse/FLINK-1844
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Faye Beligianni
>Assignee: Faye Beligianni
>Priority: Minor
>  Labels: ML, Starter
>
> In many algorithms in ML, the features' values would be better to lie between 
> a given range of values, usually in the range (0,1) [1]. Therefore, a 
> {{Transformer}} could be implemented to achieve that normalisation.
> Resources: 
> [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577192#comment-14577192
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-109997612
  
The second example was a positive example. Trailing comments are removed 
correctly. However, many different characters can be used for comments (what is 
not good from my point of view):
mjsax@T420s-dbis-mjsax:~$ SLAVE="host|comment"; if [[ "$SLAVE" =~ 
^([0-9a-zA-Z/.-]+).*$ ]] ; then SLAVE="${BASH_REMATCH[1]}"; fi; echo $SLAVE
host
mjsax@T420s-dbis-mjsax:~$ SLAVE="host@comment"; if [[ "$SLAVE" =~ 
^([0-9a-zA-Z/.-]+).*$ ]] ; then SLAVE="${BASH_REMATCH[1]}"; fi; echo $SLAVE
host
mjsax@T420s-dbis-mjsax:~$ SLAVE="host comment"; if [[ "$SLAVE" =~ 
^([0-9a-zA-Z/.-]+).*$ ]] ; then SLAVE="${BASH_REMATCH[1]}"; fi; echo $SLAVE
host





> Allow comments in 'slaves' file
> ---
>
> Key: FLINK-2174
> URL: https://issues.apache.org/jira/browse/FLINK-2174
> Project: Flink
>  Issue Type: Improvement
>  Components: Start-Stop Scripts
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Trivial
>
> Currently, each line in slaves in interpreded as a host name. Scripts should 
> skip lines starting with '#'. Also allow for comments at the end of a line 
> and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 >

1 - 100 of 303 matches

Mail list logo