[jira] [Issue Comment Deleted] (FLINK-2183) TaskManagerFailsWithSlotSharingITCase fails.
[ https://issues.apache.org/jira/browse/FLINK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sachin Goel updated FLINK-2183: --- Comment: was deleted (was: I'm not familiar with that. Is this expected? And is someone already working on it?) > TaskManagerFailsWithSlotSharingITCase fails. > > > Key: FLINK-2183 > URL: https://issues.apache.org/jira/browse/FLINK-2183 > Project: Flink > Issue Type: Bug >Reporter: Sachin Goel > > Travis build is failing on this test. > Here is the log output: > https://s3.amazonaws.com/archive.travis-ci.org/jobs/65871133/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2183) TaskManagerFailsWithSlotSharingITCase fails.
[ https://issues.apache.org/jira/browse/FLINK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578424#comment-14578424 ] Sachin Goel commented on FLINK-2183: I'm not familiar with that. Is this expected? And is someone already working on it? > TaskManagerFailsWithSlotSharingITCase fails. > > > Key: FLINK-2183 > URL: https://issues.apache.org/jira/browse/FLINK-2183 > Project: Flink > Issue Type: Bug >Reporter: Sachin Goel > > Travis build is failing on this test. > Here is the log output: > https://s3.amazonaws.com/archive.travis-ci.org/jobs/65871133/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2183) TaskManagerFailsWithSlotSharingITCase fails.
[ https://issues.apache.org/jira/browse/FLINK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578423#comment-14578423 ] Sachin Goel commented on FLINK-2183: I'm not familiar with that. Is this expected? And is someone already working on it? > TaskManagerFailsWithSlotSharingITCase fails. > > > Key: FLINK-2183 > URL: https://issues.apache.org/jira/browse/FLINK-2183 > Project: Flink > Issue Type: Bug >Reporter: Sachin Goel > > Travis build is failing on this test. > Here is the log output: > https://s3.amazonaws.com/archive.travis-ci.org/jobs/65871133/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2185) Rework semantics for .setSeed function of SVM
[ https://issues.apache.org/jira/browse/FLINK-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577893#comment-14577893 ] Dmitriy Lyubimov commented on FLINK-2185: - is svm work in Flik non-linear or linear only? > Rework semantics for .setSeed function of SVM > - > > Key: FLINK-2185 > URL: https://issues.apache.org/jira/browse/FLINK-2185 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Theodore Vasiloudis >Priority: Minor > Labels: ML > Fix For: 0.10 > > > Currently setting the seed for the SVM does not have the expected result of > producing the same output for each run of the algorithm, potentially due to > the randomness in data partitioning. > We should then rework the way the algorithm works to either ensure we get the > exact same results when the seed is set, or if that is not possible the > setSeed function should be removed. > Also in the current implementation the default value of 0 for the seed means > that all runs for which we don't set the seed will get the same seed which is > not the intended behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577863#comment-14577863 ] ASF GitHub Bot commented on FLINK-2093: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110143059 Hi @shghatge, you don't need to close a PR in order to update it. You can simply update (push or push --force into) the branch from which you created the PR and Github will automatically update the PR. This helps to have all comments about your implementation in one place. > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110143059 Hi @shghatge, you don't need to close a PR in order to update it. You can simply update (push or push --force into) the branch from which you created the PR and Github will automatically update the PR. This helps to have all comments about your implementation in one place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1601) Sometimes the YARNSessionFIFOITCase fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577841#comment-14577841 ] Sachin Goel commented on FLINK-1601: This test is failing again on travis build. The issue seems different however. Here is the log: https://api.travis-ci.org/jobs/65946506/log.txt?deansi=true > Sometimes the YARNSessionFIFOITCase fails on Travis > --- > > Key: FLINK-1601 > URL: https://issues.apache.org/jira/browse/FLINK-1601 > Project: Flink > Issue Type: Bug >Reporter: Till Rohrmann >Assignee: Robert Metzger > > Sometimes the {{YARNSessionFIFOITCase}} fails on Travis with the following > exception. > {code} > Tests run: 8, Failures: 8, Errors: 0, Skipped: 0, Time elapsed: 71.375 sec > <<< FAILURE! - in org.apache.flink.yarn.YARNSessionFIFOITCase > perJobYarnCluster(org.apache.flink.yarn.YARNSessionFIFOITCase) Time elapsed: > 60.707 sec <<< FAILURE! > java.lang.AssertionError: During the timeout period of 60 seconds the > expected string did not show up > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.apache.flink.yarn.YarnTestBase.runWithArgs(YarnTestBase.java:315) > at > org.apache.flink.yarn.YARNSessionFIFOITCase.perJobYarnCluster(YARNSessionFIFOITCase.java:185) > testQueryCluster(org.apache.flink.yarn.YARNSessionFIFOITCase) Time elapsed: > 0.507 sec <<< FAILURE! > java.lang.AssertionError: There is at least one application on the cluster is > not finished > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.flink.yarn.YarnTestBase.checkClusterEmpty(YarnTestBase.java:146) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) > {code} > The result is > {code} > Failed tests: > YARNSessionFIFOITCase.perJobYarnCluster:185->YarnTestBase.runWithArgs:315 > During the timeout period of 60 seconds the expected string did not show up > YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least > one application on the cluster is not finished > YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least > one application on the cluster is not finished > YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least > one application on the cluster is not finished > YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least > one application on the cluster is not finished > YARNSessionFIFOITCase>YarnTestBase.checkClusterEmpty:146 There is at least > one application on the cluster is not finished > YARNSessi
[jira] [Updated] (FLINK-2187) KMeans clustering is not present in release-0.9-rc1
[ https://issues.apache.org/jira/browse/FLINK-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sachin Goel updated FLINK-2187: --- Affects Version/s: 0.9 > KMeans clustering is not present in release-0.9-rc1 > --- > > Key: FLINK-2187 > URL: https://issues.apache.org/jira/browse/FLINK-2187 > Project: Flink > Issue Type: Bug >Affects Versions: 0.9 >Reporter: Sachin Goel > > The Flink ml readme > [https://github.com/apache/flink/tree/release-0.9-rc1/flink-staging/flink-ml] > contains kmeans clustering in its description. However, the kmeans is not > part of the ML library yet. It is still only present in examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2187) KMeans clustering is not present in release-0.9-rc1
Sachin Goel created FLINK-2187: -- Summary: KMeans clustering is not present in release-0.9-rc1 Key: FLINK-2187 URL: https://issues.apache.org/jira/browse/FLINK-2187 Project: Flink Issue Type: Bug Reporter: Sachin Goel The Flink ml readme [https://github.com/apache/flink/tree/release-0.9-rc1/flink-staging/flink-ml] contains kmeans clustering in its description. However, the kmeans is not part of the ML library yet. It is still only present in examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577588#comment-14577588 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110096269 No worries :) I am used to it. And I think it's good to ensure high code quality! > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > Fix For: 0.10 > > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110096269 No worries :) I am used to it. And I think it's good to ensure high code quality! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-1916) EOFException when running delta-iteration job
[ https://issues.apache.org/jira/browse/FLINK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels updated FLINK-1916: -- Affects Version/s: 0.10 > EOFException when running delta-iteration job > - > > Key: FLINK-1916 > URL: https://issues.apache.org/jira/browse/FLINK-1916 > Project: Flink > Issue Type: Bug > Components: Local Runtime >Affects Versions: 0.9, 0.10 > Environment: 0.9-milestone-1 > Exception on the cluster, local execution works >Reporter: Stefan Bunk > > The delta-iteration program in [1] ends with an > java.io.EOFException > at > org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333) > at > org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140) > at > org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeByte(AbstractPagedOutputView.java:223) > at > org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeLong(AbstractPagedOutputView.java:291) > at > org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeDouble(AbstractPagedOutputView.java:307) > at > org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:62) > at > org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:26) > at > org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:89) > at > org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:29) > at > org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:219) > at > org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536) > at > org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347) > at > org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209) > at > org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270) > at > org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) > at > org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217) > at java.lang.Thread.run(Thread.java:745) > For logs and the accompanying mailing list discussion see below. > When running with slightly different memory configuration, as hinted on the > mailing list, I sometimes also get this exception: > 19.Apr. 13:39:29 INFO Task - IterationHead(WorksetIteration > (Resolved-Redirects)) (10/10) switched to FAILED : > java.lang.IndexOutOfBoundsException: Index: 161, Size: 161 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.resetTo(InMemoryPartition.java:352) > at > org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.access$100(InMemoryPartition.java:301) > at > org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:226) > at > org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536) > at > org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347) > at > org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209) > at > org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270) > at > org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) > at > org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217) > at java.lang.Thread.run(Thread.java:745) > [1] Flink job: https://gist.github.com/knub/0bfec859a563009c1d57 > [2] Job manager logs: https://gist.github.com/knub/01e3a4b0edb8cde66ff4 > [3] One task manager's logs: https://gist.github.com/knub/8f2f953da95c8d7adefc > [4] > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/EOFException-when-running-Flink-job-td1092.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels resolved FLINK-2174. --- Resolution: Implemented Fix Version/s: 0.10 > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > Fix For: 0.10 > > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577487#comment-14577487 ] ASF GitHub Bot commented on FLINK-2093: --- Github user shghatge closed the pull request at: https://github.com/apache/flink/pull/807 > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user shghatge closed the pull request at: https://github.com/apache/flink/pull/807 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user shghatge commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110078878 Then it was just removing vertices! Talk about swatting a Fly with a Sledgehammer! I will do all the changes you suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577486#comment-14577486 ] ASF GitHub Bot commented on FLINK-2093: --- Github user shghatge commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110078878 Then it was just removing vertices! Talk about swatting a Fly with a Sledgehammer! I will do all the changes you suggested. > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577482#comment-14577482 ] ASF GitHub Bot commented on FLINK-2174: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/796 > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/796 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577479#comment-14577479 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110076740 Thank you! Sorry for giving you a hard time. > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110076740 Thank you! Sorry for giving you a hard time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [docs] Fix some typos and grammar in the Strea...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/806 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577454#comment-14577454 ] ASF GitHub Bot commented on FLINK-2093: --- Github user andralungu commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110070080 Hi @shghatge , This very nice for a first PR and I am happy to see that you followed my guidelines :) I left a set of comments in-line. Apart from those: - the difference method can be simplified. You don't need to filterOnEdges. Have a closer look at removeVertices. Imagine what happens if you remove a vertex, the edge will also have to be removed. You cannot leave an edge with just the source or the target vertex trailing. - I think you forgot to add the corner case test for an input graph which does not have common vertices with the first one. I know you wrote it :) - Finally, if you have a look at the Travis build here, it failed because you are indenting with spaces instead of tabs. You should play a bit with your IntelliJ settings. No worries! This is a rookie mistake, we all did it at first. To check everything is okay, just do a cd flink-staging/flink-gelly and then mvn verify. After it says build success, we're good to go. Rebase and update the PR. If you have questions, I'll be more than happy to answer them! Nice job! > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user andralungu commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110070080 Hi @shghatge , This very nice for a first PR and I am happy to see that you followed my guidelines :) I left a set of comments in-line. Apart from those: - the difference method can be simplified. You don't need to filterOnEdges. Have a closer look at removeVertices. Imagine what happens if you remove a vertex, the edge will also have to be removed. You cannot leave an edge with just the source or the target vertex trailing. - I think you forgot to add the corner case test for an input graph which does not have common vertices with the first one. I know you wrote it :) - Finally, if you have a look at the Travis build here, it failed because you are indenting with spaces instead of tabs. You should play a bit with your IntelliJ settings. No worries! This is a rookie mistake, we all did it at first. To check everything is okay, just do a cd flink-staging/flink-gelly and then mvn verify. After it says build success, we're good to go. Rebase and update the PR. If you have questions, I'll be more than happy to answer them! Nice job! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-1877) Stalled Flink on Tez build
[ https://issues.apache.org/jira/browse/FLINK-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger updated FLINK-1877: -- Component/s: Flink on Tez > Stalled Flink on Tez build > -- > > Key: FLINK-1877 > URL: https://issues.apache.org/jira/browse/FLINK-1877 > Project: Flink > Issue Type: Bug > Components: Flink on Tez, Tests >Affects Versions: master >Reporter: Ufuk Celebi >Assignee: Kostas Tzoumas > > https://travis-ci.org/uce/incubator-flink/jobs/57951373 > https://s3.amazonaws.com/archive.travis-ci.org/jobs/57951373/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577432#comment-14577432 ] ASF GitHub Bot commented on FLINK-2093: --- Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31932709 --- Diff: flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java --- @@ -266,6 +266,47 @@ public void testUnion() throws Exception { "6,1,61\n"; } +@Test +public void testDifference() throws Exception { + /* +* Test difference() +*/ +final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + +Graph graph = Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env), +TestGraphUtils.getLongLongEdgeData(env), env); + +List> vertices = new ArrayList>(); +List> edges = new ArrayList>(); + +vertices.remove(1); +vertices.remove(3); +vertices.remove(4); + +vertices.add(new Vertex(6L,6L)); + +edges.remove(0); +edges.remove(2); +edges.remove(3); +edges.remove(4); +edges.remove(5); +edges.remove(6); + +edges.add(new Edge(6L,1L,61L)); +edges.add(new Edge(6L,3L,63L)); + +graph = graph.difference(Graph.fromCollection(vertices, edges, env)); + +graph.getEdges().writeAsCsv(resultPath); +graph.getVertices().writeAsCsv(resultPath); --- End diff -- The graph.getVertices() should actually be in a different test; that way you could change the expected result and see that the vertices you get are actually the ones you expected. > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31932709 --- Diff: flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java --- @@ -266,6 +266,47 @@ public void testUnion() throws Exception { "6,1,61\n"; } +@Test +public void testDifference() throws Exception { + /* +* Test difference() +*/ +final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + +Graph graph = Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env), +TestGraphUtils.getLongLongEdgeData(env), env); + +List> vertices = new ArrayList>(); +List> edges = new ArrayList>(); + +vertices.remove(1); +vertices.remove(3); +vertices.remove(4); + +vertices.add(new Vertex(6L,6L)); + +edges.remove(0); +edges.remove(2); +edges.remove(3); +edges.remove(4); +edges.remove(5); +edges.remove(6); + +edges.add(new Edge(6L,1L,61L)); +edges.add(new Edge(6L,3L,63L)); + +graph = graph.difference(Graph.fromCollection(vertices, edges, env)); + +graph.getEdges().writeAsCsv(resultPath); +graph.getVertices().writeAsCsv(resultPath); --- End diff -- The graph.getVertices() should actually be in a different test; that way you could change the expected result and see that the vertices you get are actually the ones you expected. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2161) Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)
[ https://issues.apache.org/jira/browse/FLINK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577425#comment-14577425 ] ASF GitHub Bot commented on FLINK-2161: --- Github user nikste commented on the pull request: https://github.com/apache/flink/pull/805#issuecomment-110064581 You are right. I only tried it with the local execution environment. It does not work if you start a separate job manager on your machine. > Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML) > -- > > Key: FLINK-2161 > URL: https://issues.apache.org/jira/browse/FLINK-2161 > Project: Flink > Issue Type: Improvement >Reporter: Till Rohrmann >Assignee: Nikolaas Steenbergen > > Currently, there is no easy way to load and ship external libraries/jars with > Flink's Scala shell. Assume that you want to run some Gelly graph algorithms > from within the Scala shell, then you have to put the Gelly jar manually in > the lib directory and make sure that this jar is also available on your > cluster, because it is not shipped with the user code. > It would be good to have a simple mechanism how to specify additional jars > upon startup of the Scala shell. These jars should then also be shipped to > the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2161] modified Scala shell start script...
Github user nikste commented on the pull request: https://github.com/apache/flink/pull/805#issuecomment-110064581 You are right. I only tried it with the local execution environment. It does not work if you start a separate job manager on your machine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577419#comment-14577419 ] ASF GitHub Bot commented on FLINK-2093: --- Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31932048 --- Diff: flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java --- @@ -266,6 +266,47 @@ public void testUnion() throws Exception { "6,1,61\n"; } +@Test +public void testDifference() throws Exception { + /* +* Test difference() +*/ +final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + +Graph graph = Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env), +TestGraphUtils.getLongLongEdgeData(env), env); + +List> vertices = new ArrayList>(); +List> edges = new ArrayList>(); + +vertices.remove(1); +vertices.remove(3); +vertices.remove(4); + +vertices.add(new Vertex(6L,6L)); + +edges.remove(0); +edges.remove(2); +edges.remove(3); +edges.remove(4); +edges.remove(5); +edges.remove(6); --- End diff -- same for the edges > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31932002 --- Diff: flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java --- @@ -266,6 +266,47 @@ public void testUnion() throws Exception { "6,1,61\n"; } +@Test +public void testDifference() throws Exception { + /* +* Test difference() +*/ +final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + +Graph graph = Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env), +TestGraphUtils.getLongLongEdgeData(env), env); + +List> vertices = new ArrayList>(); +List> edges = new ArrayList>(); --- End diff -- I would put these in TestGraphUtils, one remove is fine, but three can make the code a bit difficult to read :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31932048 --- Diff: flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java --- @@ -266,6 +266,47 @@ public void testUnion() throws Exception { "6,1,61\n"; } +@Test +public void testDifference() throws Exception { + /* +* Test difference() +*/ +final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + +Graph graph = Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env), +TestGraphUtils.getLongLongEdgeData(env), env); + +List> vertices = new ArrayList>(); +List> edges = new ArrayList>(); + +vertices.remove(1); +vertices.remove(3); +vertices.remove(4); + +vertices.add(new Vertex(6L,6L)); + +edges.remove(0); +edges.remove(2); +edges.remove(3); +edges.remove(4); +edges.remove(5); +edges.remove(6); --- End diff -- same for the edges --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577418#comment-14577418 ] ASF GitHub Bot commented on FLINK-2093: --- Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31932002 --- Diff: flink-staging/flink-gelly/src/test/java/org/apache/flink/graph/test/operations/GraphOperationsITCase.java --- @@ -266,6 +266,47 @@ public void testUnion() throws Exception { "6,1,61\n"; } +@Test +public void testDifference() throws Exception { + /* +* Test difference() +*/ +final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + +Graph graph = Graph.fromDataSet(TestGraphUtils.getLongLongVertexData(env), +TestGraphUtils.getLongLongEdgeData(env), env); + +List> vertices = new ArrayList>(); +List> edges = new ArrayList>(); --- End diff -- I would put these in TestGraphUtils, one remove is fine, but three can make the code a bit difficult to read :) > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31930744 --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/Graph.java --- @@ -1233,6 +1233,34 @@ public void coGroup(Iterable> edge, Iterable> edgeToBeRe return new Graph(unionedVertices, unionedEdges, this.context); } +/** + * Performs Difference on the vertices and edges sets of the inputgraphs --- End diff -- "on the vertex and edge sets of the input graphs" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31930812 --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/Graph.java --- @@ -1233,6 +1233,34 @@ public void coGroup(Iterable> edge, Iterable> edgeToBeRe return new Graph(unionedVertices, unionedEdges, this.context); } +/** + * Performs Difference on the vertices and edges sets of the inputgraphs + * removes both vertices and edges with the vertex as a source/target + * @param graph the graph to perform differennce with + * @return a new graph --- End diff -- a new graph where the common vertices and edges have been removed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577404#comment-14577404 ] ASF GitHub Bot commented on FLINK-2093: --- Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31930812 --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/Graph.java --- @@ -1233,6 +1233,34 @@ public void coGroup(Iterable> edge, Iterable> edgeToBeRe return new Graph(unionedVertices, unionedEdges, this.context); } +/** + * Performs Difference on the vertices and edges sets of the inputgraphs + * removes both vertices and edges with the vertex as a source/target + * @param graph the graph to perform differennce with + * @return a new graph --- End diff -- a new graph where the common vertices and edges have been removed > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577403#comment-14577403 ] ASF GitHub Bot commented on FLINK-2093: --- Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31930744 --- Diff: flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/Graph.java --- @@ -1233,6 +1233,34 @@ public void coGroup(Iterable> edge, Iterable> edgeToBeRe return new Graph(unionedVertices, unionedEdges, this.context); } +/** + * Performs Difference on the vertices and edges sets of the inputgraphs --- End diff -- "on the vertex and edge sets of the input graphs" > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31930244 --- Diff: docs/libs/gelly_guide.md --- @@ -236,6 +236,8 @@ Graph networkWithWeights = network.joinWithEdgesOnSource(v * Union: Gelly's `union()` method performs a union on the vertex and edges sets of the input graphs. Duplicate vertices are removed from the resulting `Graph`, while if duplicate edges exists, these will be maintained. +* Difference: Gelly's `difference()` method performs a difference on the vertex and edges sets of the input graphs. Common vertices are removed from the resulting `Graph`, along with the edges which which have these vertices as source/target. --- End diff -- you have written which twice, "along with the edges which which" :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577397#comment-14577397 ] ASF GitHub Bot commented on FLINK-2093: --- Github user andralungu commented on a diff in the pull request: https://github.com/apache/flink/pull/807#discussion_r31930244 --- Diff: docs/libs/gelly_guide.md --- @@ -236,6 +236,8 @@ Graph networkWithWeights = network.joinWithEdgesOnSource(v * Union: Gelly's `union()` method performs a union on the vertex and edges sets of the input graphs. Duplicate vertices are removed from the resulting `Graph`, while if duplicate edges exists, these will be maintained. +* Difference: Gelly's `difference()` method performs a difference on the vertex and edges sets of the input graphs. Common vertices are removed from the resulting `Graph`, along with the edges which which have these vertices as source/target. --- End diff -- you have written which twice, "along with the edges which which" :) > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31927083 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Yes ^^ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577360#comment-14577360 ] ASF GitHub Bot commented on FLINK-1844: --- Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31927083 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Yes ^^ > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31926077 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Package private should be ok, since the test is in the same package, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577346#comment-14577346 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31926077 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Package private should be ok, since the test is in the same package, right? > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31925630 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains("?")) + .map{list => +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM("path/to/a8a") +val adultTest = MLUtils.readLibSVM("path/to/a8a.t") --- End diff -- Hmm for FlinkML it's probably ok to
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577343#comment-14577343 ] ASF GitHub Bot commented on FLINK-2072: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31925630 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains("?")) + .map{list => +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We ca
[jira] [Commented] (FLINK-1877) Stalled Flink on Tez build
[ https://issues.apache.org/jira/browse/FLINK-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577342#comment-14577342 ] Till Rohrmann commented on FLINK-1877: -- Another instance of the issue I guess [https://travis-ci.org/tillrohrmann/flink/jobs/65882922] [https://flink.a.o.uce.east.s3.amazonaws.com/travis-artifacts/tillrohrmann/flink/507/507.5.tar.gz] > Stalled Flink on Tez build > -- > > Key: FLINK-1877 > URL: https://issues.apache.org/jira/browse/FLINK-1877 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: master >Reporter: Ufuk Celebi >Assignee: Kostas Tzoumas > > https://travis-ci.org/uce/incubator-flink/jobs/57951373 > https://s3.amazonaws.com/archive.travis-ci.org/jobs/57951373/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577329#comment-14577329 ] ASF GitHub Bot commented on FLINK-1844: --- Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31924947 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Hey, if the {{metricsOption}} field is package private then my tests will fail, cause I am also testing in the {{MinMaxScalerITSuite}} if the min, max of each feature has been calculated correct. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31924947 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Hey, if the {{metricsOption}} field is package private then my tests will fail, cause I am also testing in the {{MinMaxScalerITSuite}} if the min, max of each feature has been calculated correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577311#comment-14577311 ] ASF GitHub Bot commented on FLINK-2093: --- GitHub user shghatge opened a pull request: https://github.com/apache/flink/pull/807 [FLINK-2093][gelly] Added difference method Tasks given on 5th June: Add a difference function to the Graph.java Modify the docs 'gelly-guide.md' Add the test case for difference() method to GraphMutationsITCase.java You can merge this pull request into a Git repository by running: $ git pull https://github.com/shghatge/flink difference Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/807.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #807 commit 61afe247fb75fcfd22e0bdbed53a7dbbefdf65cb Author: Shivani Date: 2015-06-08T14:58:22Z [FLINK-2093][gelly] Added difference method > Add a difference method to Gelly's Graph class > -- > > Key: FLINK-2093 > URL: https://issues.apache.org/jira/browse/FLINK-2093 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 0.9 >Reporter: Andra Lungu >Assignee: Shivani Ghatge >Priority: Minor > > This method will compute the difference between two graphs, returning a new > graph containing the vertices and edges that the current graph and the input > graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
GitHub user shghatge opened a pull request: https://github.com/apache/flink/pull/807 [FLINK-2093][gelly] Added difference method Tasks given on 5th June: Add a difference function to the Graph.java Modify the docs 'gelly-guide.md' Add the test case for difference() method to GraphMutationsITCase.java You can merge this pull request into a Git repository by running: $ git pull https://github.com/shghatge/flink difference Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/807.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #807 commit 61afe247fb75fcfd22e0bdbed53a7dbbefdf65cb Author: Shivani Date: 2015-06-08T14:58:22Z [FLINK-2093][gelly] Added difference method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577298#comment-14577298 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31922838 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Will make the field package private. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31922838 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Will make the field package private. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-2185) Rework semantics for .setSeed function of SVM
[ https://issues.apache.org/jira/browse/FLINK-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Theodore Vasiloudis updated FLINK-2185: --- Fix Version/s: 0.10 > Rework semantics for .setSeed function of SVM > - > > Key: FLINK-2185 > URL: https://issues.apache.org/jira/browse/FLINK-2185 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Theodore Vasiloudis >Priority: Minor > Labels: ML > Fix For: 0.10 > > > Currently setting the seed for the SVM does not have the expected result of > producing the same output for each run of the algorithm, potentially due to > the randomness in data partitioning. > We should then rework the way the algorithm works to either ensure we get the > exact same results when the seed is set, or if that is not possible the > setSeed function should be removed. > Also in the current implementation the default value of 0 for the seed means > that all runs for which we don't set the seed will get the same seed which is > not the intended behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2186) Rework SVM import to support very wide files
[ https://issues.apache.org/jira/browse/FLINK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Theodore Vasiloudis updated FLINK-2186: --- Fix Version/s: 0.10 > Rework SVM import to support very wide files > > > Key: FLINK-2186 > URL: https://issues.apache.org/jira/browse/FLINK-2186 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library, Scala API >Reporter: Theodore Vasiloudis > Fix For: 0.10 > > > In the current readVcsFile implementation, importing CSV files with many > columns can become from cumbersome to impossible. > For example to import an 11 column file we need to write: > {code} > val cancer = env.readCsvFile[(String, String, String, String, String, String, > String, String, String, String, > String)]("/path/to/breast-cancer-wisconsin.data") > {code} > For many use cases in Machine Learning we might have CSV files with thousands > or millions of columns that we want to import as vectors. > In that case using the current readCsvFile method becomes impossible. > We therefore need to rework the current function, or create a new one that > will allow us to import CSV files with an arbitrary number of columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577292#comment-14577292 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31922171 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- As private state, the developer should be able to choose any type. Thus, a `BreezeVector` should be fine here. I was just wondering, whether a `DenseVector` does not make more sense here. Is it safe to assume that every feature has at least 2 non-zero values? > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31922171 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- As private state, the developer should be able to choose any type. Thus, a `BreezeVector` should be fine here. I was just wondering, whether a `DenseVector` does not make more sense here. Is it safe to assume that every feature has at least 2 non-zero values? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-2186) Rework SVM import to support very wide files
[ https://issues.apache.org/jira/browse/FLINK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Theodore Vasiloudis updated FLINK-2186: --- Summary: Rework SVM import to support very wide files (was: Reworj SVM import to support very wide files) > Rework SVM import to support very wide files > > > Key: FLINK-2186 > URL: https://issues.apache.org/jira/browse/FLINK-2186 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library, Scala API >Reporter: Theodore Vasiloudis > > In the current readVcsFile implementation, importing CSV files with many > columns can become from cumbersome to impossible. > For example to import an 11 column file wee need to write: > {code} > val cancer = env.readCsvFile[(String, String, String, String, String, String, > String, String, String, String, > String)]("/path/to/breast-cancer-wisconsin.data") > {code} > For many use cases in Machine Learning we might have CSV files with thousands > or millions of columns that we want to import as vectors. > In that case using the current readCsvFile method becomes impossible. > We therefor need to rework the current function, or create a new one that > will allow us to import CSV files with an arbitrary number of columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2186) Reworj SVM import to support very wide files
Theodore Vasiloudis created FLINK-2186: -- Summary: Reworj SVM import to support very wide files Key: FLINK-2186 URL: https://issues.apache.org/jira/browse/FLINK-2186 Project: Flink Issue Type: Improvement Components: Machine Learning Library, Scala API Reporter: Theodore Vasiloudis In the current readVcsFile implementation, importing CSV files with many columns can become from cumbersome to impossible. For example to import an 11 column file wee need to write: {code} val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") {code} For many use cases in Machine Learning we might have CSV files with thousands or millions of columns that we want to import as vectors. In that case using the current readCsvFile method becomes impossible. We therefor need to rework the current function, or create a new one that will allow us to import CSV files with an arbitrary number of columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2186) Rework SVM import to support very wide files
[ https://issues.apache.org/jira/browse/FLINK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Theodore Vasiloudis updated FLINK-2186: --- Description: In the current readVcsFile implementation, importing CSV files with many columns can become from cumbersome to impossible. For example to import an 11 column file we need to write: {code} val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") {code} For many use cases in Machine Learning we might have CSV files with thousands or millions of columns that we want to import as vectors. In that case using the current readCsvFile method becomes impossible. We therefore need to rework the current function, or create a new one that will allow us to import CSV files with an arbitrary number of columns. was: In the current readVcsFile implementation, importing CSV files with many columns can become from cumbersome to impossible. For example to import an 11 column file wee need to write: {code} val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") {code} For many use cases in Machine Learning we might have CSV files with thousands or millions of columns that we want to import as vectors. In that case using the current readCsvFile method becomes impossible. We therefore need to rework the current function, or create a new one that will allow us to import CSV files with an arbitrary number of columns. > Rework SVM import to support very wide files > > > Key: FLINK-2186 > URL: https://issues.apache.org/jira/browse/FLINK-2186 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library, Scala API >Reporter: Theodore Vasiloudis > > In the current readVcsFile implementation, importing CSV files with many > columns can become from cumbersome to impossible. > For example to import an 11 column file we need to write: > {code} > val cancer = env.readCsvFile[(String, String, String, String, String, String, > String, String, String, String, > String)]("/path/to/breast-cancer-wisconsin.data") > {code} > For many use cases in Machine Learning we might have CSV files with thousands > or millions of columns that we want to import as vectors. > In that case using the current readCsvFile method becomes impossible. > We therefore need to rework the current function, or create a new one that > will allow us to import CSV files with an arbitrary number of columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2186) Rework SVM import to support very wide files
[ https://issues.apache.org/jira/browse/FLINK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Theodore Vasiloudis updated FLINK-2186: --- Description: In the current readVcsFile implementation, importing CSV files with many columns can become from cumbersome to impossible. For example to import an 11 column file wee need to write: {code} val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") {code} For many use cases in Machine Learning we might have CSV files with thousands or millions of columns that we want to import as vectors. In that case using the current readCsvFile method becomes impossible. We therefore need to rework the current function, or create a new one that will allow us to import CSV files with an arbitrary number of columns. was: In the current readVcsFile implementation, importing CSV files with many columns can become from cumbersome to impossible. For example to import an 11 column file wee need to write: {code} val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") {code} For many use cases in Machine Learning we might have CSV files with thousands or millions of columns that we want to import as vectors. In that case using the current readCsvFile method becomes impossible. We therefor need to rework the current function, or create a new one that will allow us to import CSV files with an arbitrary number of columns. > Rework SVM import to support very wide files > > > Key: FLINK-2186 > URL: https://issues.apache.org/jira/browse/FLINK-2186 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library, Scala API >Reporter: Theodore Vasiloudis > > In the current readVcsFile implementation, importing CSV files with many > columns can become from cumbersome to impossible. > For example to import an 11 column file wee need to write: > {code} > val cancer = env.readCsvFile[(String, String, String, String, String, String, > String, String, String, String, > String)]("/path/to/breast-cancer-wisconsin.data") > {code} > For many use cases in Machine Learning we might have CSV files with thousands > or millions of columns that we want to import as vectors. > In that case using the current readCsvFile method becomes impossible. > We therefore need to rework the current function, or create a new one that > will allow us to import CSV files with an arbitrary number of columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2185) Rework semantics for .setSeed function of SVM
Theodore Vasiloudis created FLINK-2185: -- Summary: Rework semantics for .setSeed function of SVM Key: FLINK-2185 URL: https://issues.apache.org/jira/browse/FLINK-2185 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Priority: Minor Currently setting the seed for the SVM does not have the expected result of producing the same output for each run of the algorithm, potentially due to the randomness in data partitioning. We should then rework the way the algorithm works to either ensure we get the exact same results when the seed is set, or if that is not possible the setSeed function should be removed. Also in the current implementation the default value of 0 for the seed means that all runs for which we don't set the seed will get the same seed which is not the intended behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921747 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These values are used in the transform step +* to transform the given input data. +* +* @tparam T Input data type which is a subtype of [[Vector]] +* @return +*/ + implicit def fitVectorMinMaxScaler[T <: Vector] = ne
[jira] [Commented] (FLINK-2182) Add stateful Streaming Sequence Source
[ https://issues.apache.org/jira/browse/FLINK-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577285#comment-14577285 ] ASF GitHub Bot commented on FLINK-2182: --- Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/804#issuecomment-110022371 I modified the ComplexIntegrationTest to make it work with the different ordering of the new checkpointable parallel sequence source. Maybe someone should have a look at it. > Add stateful Streaming Sequence Source > -- > > Key: FLINK-2182 > URL: https://issues.apache.org/jira/browse/FLINK-2182 > Project: Flink > Issue Type: Improvement > Components: eaming, Streaming >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577286#comment-14577286 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921747 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These
[GitHub] flink pull request: [FLINK-2182] Add stateful Streaming Sequence S...
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/804#issuecomment-110022371 I modified the ComplexIntegrationTest to make it work with the different ordering of the new checkpointable parallel sequence source. Maybe someone should have a look at it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577284#comment-14577284 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921716 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921716 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These values are used in the transform step +* to transform the given input data. +* +* @tparam T Input data type which is a subtype of [[Vector]] +* @return --- End diff -- Well the function's return type is a `FitO
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577280#comment-14577280 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921435 --- Diff: docs/libs/ml/minMax_scaler.md --- @@ -0,0 +1,112 @@ +--- +mathjax: include +htmlTitle: FlinkML - MinMax Scaler +title: FlinkML - MinMax Scaler +--- + + +* This will be replaced by the TOC +{:toc} + +## Description + + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max]. + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval. + Given a set of input data $x_1, x_2,... x_n$, with minimum value: + + $$x_{min} = min({x_1, x_2,..., x_n})$$ + + and maximum value: + + $$x_{max} = max({x_1, x_2,..., x_n})$$ + +The scaled data set $z_1, z_2,...,z_n$ will be: + + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$ + +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale. + +## Operations + +`MinMaxScaler` is a `Transformer`. +As such, it supports the `fit` and `transform` operation. + +### Fit + +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`: + +* `fit[T <: Vector]: DataSet[T] => Unit` +* `fit: DataSet[LabeledVector] => Unit` + +### Transform + +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type: + +* `transform[T <: Vector]: DataSet[T] => DataSet[T]` +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]` + +## Parameters + +The MinMax scaler implementation can be controlled by the following two parameters: + + + + + Parameters + Description + + + + + + Min + + + The minimum value of the range for the scaled data set. (Default value: 0.0) + + + + + Max + + + The maximum value of the range for the scaled data set. (Default value: 1.0) + + + + + + +## Examples + +{% highlight scala %} +// Create MinMax scaler transformer +val minMaxscaler = MinMaxScaler() +.setMin(-1.0) --- End diff -- Will address this when merging. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921435 --- Diff: docs/libs/ml/minMax_scaler.md --- @@ -0,0 +1,112 @@ +--- +mathjax: include +htmlTitle: FlinkML - MinMax Scaler +title: FlinkML - MinMax Scaler +--- + + +* This will be replaced by the TOC +{:toc} + +## Description + + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max]. + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval. + Given a set of input data $x_1, x_2,... x_n$, with minimum value: + + $$x_{min} = min({x_1, x_2,..., x_n})$$ + + and maximum value: + + $$x_{max} = max({x_1, x_2,..., x_n})$$ + +The scaled data set $z_1, z_2,...,z_n$ will be: + + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$ + +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale. + +## Operations + +`MinMaxScaler` is a `Transformer`. +As such, it supports the `fit` and `transform` operation. + +### Fit + +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`: + +* `fit[T <: Vector]: DataSet[T] => Unit` +* `fit: DataSet[LabeledVector] => Unit` + +### Transform + +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type: + +* `transform[T <: Vector]: DataSet[T] => DataSet[T]` +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]` + +## Parameters + +The MinMax scaler implementation can be controlled by the following two parameters: + + + + + Parameters + Description + + + + + + Min + + + The minimum value of the range for the scaled data set. (Default value: 0.0) + + + + + Max + + + The maximum value of the range for the scaled data set. (Default value: 1.0) + + + + + + +## Examples + +{% highlight scala %} +// Create MinMax scaler transformer +val minMaxscaler = MinMaxScaler() +.setMin(-1.0) --- End diff -- Will address this when merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577279#comment-14577279 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921419 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. --- End diff -- You're right. Will add it when I merge it. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921419 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. --- End diff -- You're right. Will add it when I merge it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577248#comment-14577248 ] ASF GitHub Bot commented on FLINK-1981: --- Github user sekruse commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-110013765 Okay, will do that. > Add GZip support > > > Key: FLINK-1981 > URL: https://issues.apache.org/jira/browse/FLINK-1981 > Project: Flink > Issue Type: New Feature > Components: Core >Reporter: Sebastian Kruse >Assignee: Sebastian Kruse >Priority: Minor > Fix For: 0.9 > > > GZip, as a commonly used compression format, should be supported in the same > way as the already supported deflate files. This allows to use GZip files > with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1981] add support for GZIP files
Github user sekruse commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-110013765 Okay, will do that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-2184) Cannot get last element with maxBy/minBy
Gábor Hermann created FLINK-2184: Summary: Cannot get last element with maxBy/minBy Key: FLINK-2184 URL: https://issues.apache.org/jira/browse/FLINK-2184 Project: Flink Issue Type: Improvement Components: Scala API, Streaming Reporter: Gábor Hermann Priority: Minor In the streaming Scala API there is no method {{maxBy(int positionToMaxBy, boolean first)}} nor {{minBy(int positionToMinBy, boolean first)}} like in the Java API, where _first_ set to {{true}} indicates that the latest found element will return. These methods should be added to the Scala API too, in order to be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator
[ https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aljoscha Krettek closed FLINK-2054. --- Resolution: Fixed Fix Version/s: 0.9 Resolved in https://github.com/apache/flink/commit/26304c20ab6aab33daba775736061102bd7a2409 > StreamOperator rework removed copy calls when passing output to a chained > operator > -- > > Key: FLINK-2054 > URL: https://issues.apache.org/jira/browse/FLINK-2054 > Project: Flink > Issue Type: Bug > Components: Streaming >Reporter: Gyula Fora >Assignee: Aljoscha Krettek >Priority: Blocker > Fix For: 0.9 > > > Before the recent rework of stream operators to be push based, operators held > the semantics that any input (and also output to be specific) will not be > mutated afterwards. This was achieved by simply copying records that were > passed to other chained operators. > This feature has been removed thus introducing a major break in the operator > mutability guarantees. > To make chaining viable in all cases (and to prevent hidden bugs) we need to > reintroduce the copying logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator
[ https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577229#comment-14577229 ] ASF GitHub Bot commented on FLINK-2054: --- Github user aljoscha closed the pull request at: https://github.com/apache/flink/pull/803 > StreamOperator rework removed copy calls when passing output to a chained > operator > -- > > Key: FLINK-2054 > URL: https://issues.apache.org/jira/browse/FLINK-2054 > Project: Flink > Issue Type: Bug > Components: Streaming >Reporter: Gyula Fora >Assignee: Aljoscha Krettek >Priority: Blocker > > Before the recent rework of stream operators to be push based, operators held > the semantics that any input (and also output to be specific) will not be > mutated afterwards. This was achieved by simply copying records that were > passed to other chained operators. > This feature has been removed thus introducing a major break in the operator > mutability guarantees. > To make chaining viable in all cases (and to prevent hidden bugs) we need to > reintroduce the copying logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...
Github user aljoscha closed the pull request at: https://github.com/apache/flink/pull/803 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577224#comment-14577224 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110005377 Ok. So I remove the first if-then and update this PR. > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2000) Add SQL-style aggregations for Table API
[ https://issues.apache.org/jira/browse/FLINK-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aljoscha Krettek closed FLINK-2000. --- Resolution: Fixed Fix Version/s: 0.9 Implemented in https://github.com/apache/flink/commit/7805db813dd744f13776320d556e1cefa0351464 > Add SQL-style aggregations for Table API > > > Key: FLINK-2000 > URL: https://issues.apache.org/jira/browse/FLINK-2000 > Project: Flink > Issue Type: Improvement > Components: Table API >Reporter: Aljoscha Krettek >Assignee: Cheng Hao >Priority: Minor > Fix For: 0.9 > > > Right now, the syntax for aggregations is "a.count, a.min" and so on. We > could in addition offer "COUNT(a), MIN(a)" and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2000) Add SQL-style aggregations for Table API
[ https://issues.apache.org/jira/browse/FLINK-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577223#comment-14577223 ] ASF GitHub Bot commented on FLINK-2000: --- Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/782#issuecomment-110005322 I merged it manually but forgot to add the tag that closes this Pull Request. Could you please close it? And also thanks for your work. :+1: Are you interested in doing a more involved extension of the Table API? I have several ideas that could be pursued. > Add SQL-style aggregations for Table API > > > Key: FLINK-2000 > URL: https://issues.apache.org/jira/browse/FLINK-2000 > Project: Flink > Issue Type: Improvement > Components: Table API >Reporter: Aljoscha Krettek >Assignee: Cheng Hao >Priority: Minor > Fix For: 0.9 > > > Right now, the syntax for aggregations is "a.count, a.min" and so on. We > could in addition offer "COUNT(a), MIN(a)" and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110005377 Ok. So I remove the first if-then and update this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577221#comment-14577221 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110005201 No need for a vote. I agree with you that it is better to have only one comment character. The cut command is also ok. > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2000] [table] Add sql style aggregation...
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/782#issuecomment-110005322 I merged it manually but forgot to add the tag that closes this Pull Request. Could you please close it? And also thanks for your work. :+1: Are you interested in doing a more involved extension of the Table API? I have several ideas that could be pursued. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110005201 No need for a vote. I agree with you that it is better to have only one comment character. The cut command is also ok. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577216#comment-14577216 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110004370 See your point. But we need to come to a conclusion. - I like my single line comment better (of course ;)) -- I thinks it's easier to read than any regex -- and I don't think, that the cut-process in an issue - the last command you suggested is taking the first matching token, thus, there are still multiple cut-off characters Should we start a vote? > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110004370 See your point. But we need to come to a conclusion. - I like my single line comment better (of course ;)) -- I thinks it's easier to read than any regex -- and I don't think, that the cut-process in an issue - the last command you suggested is taking the first matching token, thus, there are still multiple cut-off characters Should we start a vote? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [docs] Fix some typos and grammar in the Strea...
Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/806#issuecomment-110001882 Thanks, @ggevay. I am adding my changes on top of yours. Documenting the state amongst others. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577214#comment-14577214 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110002485 It can be a backwards-compatibility problem if users used some custom comment character. Probably negligible. Concerning the second if statement: Apparently, there was some feature along the lines of "domain/hostname". This is not documented at all and can be removed. We can create a separate JIRA for that. > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110002485 It can be a backwards-compatibility problem if users used some custom comment character. Probably negligible. Concerning the second if statement: Apparently, there was some feature along the lines of "domain/hostname". This is not documented at all and can be removed. We can create a separate JIRA for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2161) Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)
[ https://issues.apache.org/jira/browse/FLINK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577207#comment-14577207 ] ASF GitHub Bot commented on FLINK-2161: --- Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/805#issuecomment-110001500 I think this is only half of what is needed. The other half would be sending the jar files along when submitting the job with the modified RemoteEnvironment. > Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML) > -- > > Key: FLINK-2161 > URL: https://issues.apache.org/jira/browse/FLINK-2161 > Project: Flink > Issue Type: Improvement >Reporter: Till Rohrmann >Assignee: Nikolaas Steenbergen > > Currently, there is no easy way to load and ship external libraries/jars with > Flink's Scala shell. Assume that you want to run some Gelly graph algorithms > from within the Scala shell, then you have to put the Gelly jar manually in > the lib directory and make sure that this jar is also available on your > cluster, because it is not shipped with the user code. > It would be good to have a simple mechanism how to specify additional jars > upon startup of the Scala shell. These jars should then also be shipped to > the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2161] modified Scala shell start script...
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/805#issuecomment-110001500 I think this is only half of what is needed. The other half would be sending the jar files along when submitting the job with the modified RemoteEnvironment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577203#comment-14577203 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110001000 And I don't think it a compatibility problem. I think, '#' is no valid characters for host names. Thus, nobody can have it in any slave file anyway. > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110001000 And I don't think it a compatibility problem. I think, '#' is no valid characters for host names. Thus, nobody can have it in any slave file anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-10868 I did not understand the purpose of the second if-then at all... But the first does not remove comment lines. My introduces code works perfectly fine. IMHO, it's cleaner code and I would remove the first if-then completely and replace it with my single-line-cut command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577201#comment-14577201 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-10868 I did not understand the purpose of the second if-then at all... But the first does not remove comment lines. My introduces code works perfectly fine. IMHO, it's cleaner code and I would remove the first if-then completely and replace it with my single-line-cut command. > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-109998768 My bad. Sorry for bothering. I had slightly modified the example. We could, however, just remove the if statement and it would work without creating an external process: ```bash [[ "hello bli blub 123 353" =~ ^([0-9a-zA-Z/.-]+).*$ ]] SLAVE=${BASH_REMATCH[1]} ``` I agree that it is cleaner to have only one comment character. Only disadvantage I see is that it breaks backwards-compatibility. The whole thing needs a bit of cleanup. For example, the "network topology" seems a bit weird and is probably a legacy feature. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577196#comment-14577196 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-109998768 My bad. Sorry for bothering. I had slightly modified the example. We could, however, just remove the if statement and it would work without creating an external process: ```bash [[ "hello bli blub 123 353" =~ ^([0-9a-zA-Z/.-]+).*$ ]] SLAVE=${BASH_REMATCH[1]} ``` I agree that it is cleaner to have only one comment character. Only disadvantage I see is that it breaks backwards-compatibility. The whole thing needs a bit of cleanup. For example, the "network topology" seems a bit weird and is probably a legacy feature. > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31914466 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Not right now, so these can remain. I was mostly concerned that this parameter was user-facing, meaning the user had to provide Breeze vectors as parameters, but that is not the case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577193#comment-14577193 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31914466 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Not right now, so these can remain. I was mostly concerned that this parameter was user-facing, meaning the user had to provide Breeze vectors as parameters, but that is not the case. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577192#comment-14577192 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-109997612 The second example was a positive example. Trailing comments are removed correctly. However, many different characters can be used for comments (what is not good from my point of view): mjsax@T420s-dbis-mjsax:~$ SLAVE="host|comment"; if [[ "$SLAVE" =~ ^([0-9a-zA-Z/.-]+).*$ ]] ; then SLAVE="${BASH_REMATCH[1]}"; fi; echo $SLAVE host mjsax@T420s-dbis-mjsax:~$ SLAVE="host@comment"; if [[ "$SLAVE" =~ ^([0-9a-zA-Z/.-]+).*$ ]] ; then SLAVE="${BASH_REMATCH[1]}"; fi; echo $SLAVE host mjsax@T420s-dbis-mjsax:~$ SLAVE="host comment"; if [[ "$SLAVE" =~ ^([0-9a-zA-Z/.-]+).*$ ]] ; then SLAVE="${BASH_REMATCH[1]}"; fi; echo $SLAVE host > Allow comments in 'slaves' file > --- > > Key: FLINK-2174 > URL: https://issues.apache.org/jira/browse/FLINK-2174 > Project: Flink > Issue Type: Improvement > Components: Start-Stop Scripts >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Trivial > > Currently, each line in slaves in interpreded as a host name. Scripts should > skip lines starting with '#'. Also allow for comments at the end of a line > and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)