[GitHub] flink issue #2337: [FLINK-3042] [FLINK-3060] [types] Define a way to let typ...
Github user aalexandrov commented on the issue: https://github.com/apache/flink/pull/2337 @greghogan I see, IMHO then the `1.0.0` tag should be dropped from the "Fix versions" field (if possible). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2337: [FLINK-3042] [FLINK-3060] [types] Define a way to let typ...
Github user aalexandrov commented on the issue: https://github.com/apache/flink/pull/2337 @twalthr The issue is marked with "Fix versions: 1.0.0, 1.2.0" but I could only fin the `TypeInfoFactory` in the `master` branch. Shouldn't this be visible in the `release-1.0` as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #1517: [FLINK-3477] [runtime] Add hash-based combine strategy fo...
Github user aalexandrov commented on the issue: https://github.com/apache/flink/pull/1517 @ggevay :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3477] [runtime] Add hash-based combine ...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/1517#issuecomment-208083609 We've summarized the use-case around the hash aggregation experiments in [a blog post on the Peel webpage](http://peel-framework.org/2016/04/07/hash-aggregations-in-flink.html). If you follow the instructions from the **Repeatability** section you should be able to reproduce the results on other environments without too much hastle. I hope that this will be the first of many Flink-related public Peel bundles. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3477] [runtime] Add hash-based combine ...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/1517#issuecomment-202868272 Overall it seems that the hash-based combiner works better than the sort-based one for (a) uniform, or normal key distribution, and (b) fixed-length records. For skewed key distribution (like Zipf) the two strategies are practically equal, and for variable-length record the extra effort in compacting the record offsets the advanges of the hash-based aggregation approach. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3477] [runtime] Add hash-based combine ...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/1517#issuecomment-202864630 @fhueske We have used the Easter break to conduct the experiments. A preliminary writeup is in the Google Doc. @ggevay will provide the results analysis later today. Cheers! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2237] [runtime] Add hash-based combiner...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/1517#issuecomment-187243415 I've added a [Google Doc](https://docs.google.com/document/d/12yx7olVrkooceaQPoR1nkk468lIq0xOObY5ukWuNEcM/edit?usp=sharing) where we can collaborate on the design of the experiments. Once we're fixed on that, we will proceed by implementing them. The code [will be available in the `flink-hashagg-experiments` repository](https://github.com/TU-Berlin-DIMA/flink-hashagg-experiments). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2237] [runtime] Add hash-based combiner...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/1517#issuecomment-186257996 @fhueske We will propose three experiments based on your suggestions in a Google Doc on Monday. Once we have fixed the setup we will prepare a Peel bundle an run them on one of the clusters in the lab. I would be also happy to promote Peel by leading a joint blog post for the Peel website together with @ggevay and you if you are interested. I think that the hash-table makes for a perfect use-case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2237] [runtime] Add hash-based combiner...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/1517#issuecomment-186226154 @ggevay I would be interesting in helping you prepare a [`peel-flink-bundle`](http://peel-framework.org/getting-started.html) for the benchmarks @fhueske mentioned. It will make for a perfect use-case for what Peel is intended and a nice first contribution to the [bundles repository](http://peel-framework.org/repository/bundles.html). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [docs] Fix documentation for building Flink wi...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/1260#issuecomment-149848186 @tillrohrmann The list of TODOs from the JIRA issue is referring concretely to IntelliJ. Is there some Eclipse user who can see what the corresponding actions for Eclipse would be after I make a PR for that. Also, I think that this belongs to a part of the documentation related to development. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [docs] Fix documentation for building Flink wi...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/1260#issuecomment-149847793 :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2809] [scala-api] Added UnitTypeInfo an...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/1217#issuecomment-148901986 > Not all methods without paremeters should translate to methods without parenthesis... @StephanEwen I agree with that, but I cannot understand how the `UnitTypeInfo` might cause a confusion here. The typeInformation macros are synthesized by the macro based on the inferred collection type, which means that the meaning of `()` is resolved before that. Consider the following example: ```scala // in the Scala REPL case class Foo(answer: Int) // defined class Foo def f1(): Foo = Foo(42) // f1: ()Foo def f2: Foo = Foo(42) // f2: Foo val xs = Seq(f1(), f2) // how a literate person would write it // xs: Seq[Foo] = List(Foo(42), Foo(42)) val xs = Seq(f1, f2) // how a dazed & confused person would write it, but still compiles // xs: Seq[Foo] = List(Foo(42), Foo(42)) val xs = Seq(f1, f2()) // even worse, but this breaks with a compiler exception // error: Foo does not take parameters // val xs = Seq(f1, f2()) val xs = Seq((), ()) // typing '()' without syntactic context resolves to Unit // xs: Seq[Unit] = List((), ()) ``` In all of the above situations `env.fromCollection(xs)` is (1) either going to typecheck and trigger `TypeInformation` synthesis or (2) fail with the above. Can you point to StackOverflow conversation or something similar where the issue you mention is explained with an example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [docs] Fix documentation for building Flink wi...
GitHub user aalexandrov opened a pull request: https://github.com/apache/flink/pull/1260 [docs] Fix documentation for building Flink with Scala 2.11 or 2.10 This fixes some dead anchor links and aligns the text in "Build Flink for a specific Scala version" against the commands required by the current master. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aalexandrov/flink doc-scalabuild-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1260.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1260 commit fd86d72394b4dedaf6724398b041655825f0c7d4 Author: Alexander Alexandrov Date: 2015-10-15T22:21:23Z [docs] Fix documentation for building Flink with Scala 2.11 or 2.10 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Removed broken dependency to flink-spargel.
GitHub user aalexandrov opened a pull request: https://github.com/apache/flink/pull/1259 Removed broken dependency to flink-spargel. I think that this should have been removed as part of #1229. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aalexandrov/flink patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1259.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1259 commit dff19c4b27f201cfcbe104073005bc9aad2cbb45 Author: Alexander Alexandrov Date: 2015-10-15T21:18:39Z Removed broken dependency to flink-spargel. I think that this should have been removed as part of #1229. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1789] [core] [runtime] [java-api] Allow...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/593#issuecomment-145984551 It could be that the test is not executed for Scala 2.11. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1789] [core] [runtime] [java-api] Allow...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/593#issuecomment-145974916 @mxm The issue is probably related to Scala 2.10, since the two passing builds have `Dscala-2.11`. I'll investigate tonight if I can get a hold on the PR commits (GitHub is somewhat slow at the moment) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1789] [core] [runtime] [java-api] Allow...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/593#issuecomment-142960455 @mxm The jars in which folder? The main motivation point of the pull request is to add the option to add a classpaths where one can generate code at runtime. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1789] [core] [runtime] [java-api] Allow...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/593#issuecomment-141459804 @rmetzger the final 10.0, this and the Scala 2.11 compatibility are the two pending issues that make the current Emma master incompatible with vanilla Flink. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1789] [core] [runtime] [java-api] Allow...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/593#issuecomment-141455406 @twalthr @rmetzger is there a chance to include this PR in the 10.0 release? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1297] Added OperatorStatsAccumulator fo...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/605#issuecomment-127558961 @StephanEwen To be honest I'm kind of puzzled and somewhat annoyed that a PR that 1) adds a feature that has been on the list for at least 2 years, 2) has been tested and evaluated thoroughly, and 3) has been lying around since April and was rebased at least three times is still open. I was also not aware that there is a strict maintainer policy for `flink-contrib` commits. IMHO this is a good place to expose non-critical / exploratory / unstable features that might be of benefit for multiple users to the project. If these turn out the be useful for enough people, I'm sure there will be incentive to maintain them. Otherwise there is always the option to deprecate and cleanup recent additions to the contrib prior to the next release if nobody picked up on them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2408] Define all maven properties outsi...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/941#issuecomment-125192692 I guess the "s" in the name is the giveaway... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2408] Define all maven properties outsi...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/941#issuecomment-125189988 If sbt can resolve profile-based dependencies this should be OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2408] Define all maven properties outsi...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/941#issuecomment-125182364 Then it should be good to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2408] Define all maven properties outsi...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/941#issuecomment-125180542 Is the removal of the property-based activation (`-Dscala-2.11`) also intentional? I think this might be used in the travis.yml. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2231] Create a Serializer for Scala Enu...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/935#issuecomment-124582468 PS. The third commit fixes a compilation error in IntelliJ when the 'scala_2.11' profile is active. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2231] Create a Serializer for Scala Enu...
GitHub user aalexandrov opened a pull request: https://github.com/apache/flink/pull/935 [FLINK-2231] Create a Serializer for Scala Enumerations. This closes FLINK-2231. The code should work for all objects which follow [the Enumeration idiom outlined in the ScalaDoc](http://www.scala-lang.org/api/2.11.5/index.html#scala.Enumeration). The second commit removes the boilerplate code from the `EnumValueComparator` by delegating to an `IntComparator`, you can either discard or squash it while merging depending on your preference. Bear in mind the FIXME at line 368 in `TypeAnalyzer.scala`. The commented code is better, but unfortunately doesn't work with Scala 2.10, so I used the FQN workaround. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aalexandrov/flink FLINK-2231 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/935.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #935 commit fd69bda383f6771e87ded1b4b595a395519efd6e Author: Alexander Alexandrov Date: 2015-07-24T10:36:17Z [FLINK-2231] Create a Serializer for Scala Enumerations. commit dca03720d090c383f88a57af6808fdbfd2c4ec29 Author: Alexander Alexandrov Date: 2015-07-24T16:43:14Z Delegating EnumValueComparator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-120423639 Looks very neat! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-119622710 @rmetzger if you can hack your way with regex's in IntelliJ / Eclipse most of the work should be handled by the IDE. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2311] Fixes 'flink-*' dependency scope ...
Github user aalexandrov closed the pull request at: https://github.com/apache/flink/pull/880 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/885#discussion_r33858475 --- Diff: docs/apis/programming_guide.md --- @@ -187,7 +187,17 @@ that creates the type information for Flink operations. + Scala Dependency Versions +Because Scala 2.10 binary is not compatible with Scala 2.11 binary, we provide multiple artifacts +to support both Scala versions. If you want to run your program on Flink with Scala 2.11, you need +to add a suffix `_2.11` to all Flink artifact ids in your dependencies. You should be careful with +this difference of artifact id. All modules with Scala 2.11 have a suffix `_2.11` in artifact id. +For example, `flink-java` should be changed to `flink-java_2.11` and `flink-clients` should be +changed to `flink-clients_2.11`. --- End diff -- Maybe a list of all modules indicating which requires a suffix and which not will be helpful (or you can just link to Maven central). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/885#discussion_r33858383 --- Diff: docs/apis/programming_guide.md --- @@ -187,7 +187,17 @@ that creates the type information for Flink operations. + Scala Dependency Versions +Because Scala 2.10 binary is not compatible with Scala 2.11 binary, we provide multiple artifacts +to support both Scala versions. If you want to run your program on Flink with Scala 2.11, you need +to add a suffix `_2.11` to all Flink artifact ids in your dependencies. You should be careful with +this difference of artifact id. All modules with Scala 2.11 have a suffix `_2.11` in artifact id. +For example, `flink-java` should be changed to `flink-java_2.11` and `flink-clients` should be +changed to `flink-clients_2.11`. --- End diff -- Because Scala 2.10 binary is not compatible with Scala 2.11 binary, we provide multiple artifacts to support both Scala versions. Starting from the 0.10 line, we cross-build all Scala-dependent Flink modules for both 2.10 and 2.11. If you want to run your program on Flink with Scala 2.11, you need to add a `_2.11` suffix to the `artifactId` values of the Scala-dependent Flink modules in your `dependencies` section. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-118326627 I think that a cleaner solution would be to 1. Add a `scala.suffix` to the Scala profiles. This should be set as the empty string for 2.10 and as `_${scala.binary.version}` for 2.11. 2. Append the suffix to the required `artifactId` elements. For example, instead of ``` ${flink.clients.artifactId} ``` you would write ``` flink-clients${scala.suffix} ``` IMHO this makes everything a bit easier to read and then a straight-forward to migrate to the "best practice" way of having a suffix for all profiles in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/885#discussion_r33857654 --- Diff: docs/apis/programming_guide.md --- @@ -136,7 +136,7 @@ mvn archetype:generate / -The archetypes are working for stable releases and preview versions (`-SNAPSHOT`) +The archetypes are working for stable releases and preview versions. (`-SNAPSHOT`) --- End diff -- The dot should be after (`-SNAPSHOT`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2311] Fixes 'flink-*' dependency scope ...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/880#issuecomment-118083599 OK then we can close the issue as wontfix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2311] Fixes 'flink-*' dependency scope ...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/880#issuecomment-118066010 OK thanks for the remark. Although somewhat verbose, this solves my concrete issue. I wonder if the list can be exclusive, e.g. ```xml org.apache.flink flink-tweet-inputformat ${project.version} ``` However, it still leaves the question what policy to use for flink-* based dependencies. As an example, if I want to include `flink-streaming-contrib` in a module that is build as a "thin" jar, transitive `flink-*` dependencies that will be packaged in the `flink-dist` fat jar do not need to be included. I therefore suggest to at least make this consistent [with the quickstart poms](https://github.com/apache/flink/blob/master/flink-quickstart/flink-quickstart-java/src/main/resources/archetype-resources/pom.xml) and [optionally set the scope of already included dependencies to `provided` if the `build-jar` profile is activated](https://github.com/apache/flink/blob/master/flink-quickstart/flink-quickstart-java/src/main/resources/archetype-resources/pom.xml#L317). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2311] Fixes 'flink-*' dependency scope ...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/880#issuecomment-118063336 The PR is still open and can be adapted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2311] Fixes 'flink-*' dependency scope ...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/880#issuecomment-118036089 @StephanEwen can you tell me what package doesn't work so I can test it out? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2311] Fixes 'flink-*' dependency scope ...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/880#issuecomment-118035911 @mjsax exactly. Let's at the moment I land in the following situtaion. 1. I want to create a job which includes the [statistics collection facility proposed by @tammymendt](https://issues.apache.org/jira/browse/FLINK-1297). 1. In order to do that, my project needs to set `flink-contrib` with scope `compile` and use it in the final artifact (even if I am building a *thin* jar which is going to be submitted to a standalone flink cluster). 1. The jar that turns out to be ~ 64 MB. Out of that, half of the bits are taken by code is *guaranteed* to be in the classpath at rintime by virtue of being included in the `$FLINK_DIR/lib/flink-dist-$VERSION.jar`. 1. Setting all transitive `flink-*` dependencies via `flink-contrib` as `provided` reduces the size to ~ 32 MB by default. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2311] Fixes 'flink-*' dependency scope ...
GitHub user aalexandrov opened a pull request: https://github.com/apache/flink/pull/880 [FLINK-2311] Fixes 'flink-*' dependency scope in flink-contrib. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aalexandrov/flink FLINK-2311 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/880.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #880 commit f6e5dce4dfdd5f0cf1c847171fd48df010232790 Author: Alexander Alexandrov Date: 2015-07-02T07:25:03Z [FLINK-2311] Fixes 'flink-*' dependency scope in flink-contrib. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-1999] basic TfidfTransformer
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/730#discussion_r32620391 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala --- @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.feature + +import org.apache.flink.api.scala.{DataSet, _} +import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.math.SparseVector + +import scala.collection.mutable.LinkedHashSet +import scala.math.log; + +/** + * This transformer calcuates the term-frequency times inverse document frequency for the give DataSet of documents. + * The DataSet will be treated as the corpus of documents. The single words will be filtered against the regex: + * + * (?u)\b\w\w+\b + * + * + * The TF is the frequence of a word inside one document + * + * The IDF for a word is calculated: log(total number of documents / documents that contain the word) + 1 --- End diff -- I think this was aligned with the behaviour observed in skikit-learn. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1735] Feature Hasher
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/665#issuecomment-106914608 @FelixNeutatz I think this needs to be squashed before mergning as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1731] [ml] Implementation of Feature K-...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/700#issuecomment-106911412 Can anybody with more Apache insight answer to @peedeeX21 concerns? Otherwise I suggest to merge this and open a follow-up issue that extends the current implementation to KMeans++. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: basic TfidfTransformer
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/730#issuecomment-106910157 @rbraeunlich please squash the commits and prepare this for merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1979] Lossfunctions
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/656#issuecomment-106907665 This needs a PR before it is merged. @jojo19893, @mguldner please cherry pick your changes such that you have only one or two commits and force push in this branch as suggested by @tillrohrmann in the JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: basic TfidfTransformer
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/730#discussion_r31114073 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala --- @@ -0,0 +1,107 @@ +package org.apache.flink.ml.feature + +import org.apache.flink.api.scala.{DataSet, _} +import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.math.SparseVector + +import scala.math.log +import scala.util.hashing.MurmurHash3 +import scala.collection.mutable.LinkedHashSet; + +/** + * This transformer calcuates the term-frequency times inverse document frequency for the give DataSet of documents. + * The DataSet will be treated as the corpus of documents. + * + * The TF is the frequence of a word inside one document + * + * The IDF for a word is calculated: log(total number of documents / documents that contain the word) + 1 + * + * This transformer returns a SparseVector where the index is the hash of the word and the value the tf-idf. + * @author Ronny Bräunlich + * @author Vassil Dimov + * @author Filip Perisic + */ +class TfIdfTransformer extends Transformer[(Int, Seq[String]), (Int, SparseVector)] { + + override def transform(input: DataSet[(Int /* docId */ , Seq[String] /*The document */ )], transformParameters: ParameterMap): DataSet[(Int, SparseVector)] = { + +val params = transformParameters.get(StopWordParameter) + +// Here we will store the words in he form (docId, word, count) +// Count represent the occurrence of "word" in document "docId" +val wordCounts = input + //count the words + .flatMap(t => { + //create tuples docId, word, 1 + t._2.map(s => (t._1, s, 1)) +}) + .filter(t => !transformParameters.apply(StopWordParameter).contains(t._2)) + //group by document and word + .groupBy(0, 1) + // calculate the occurrence count of each word in specific document + .sum(2) + +val dictionary = wordCounts + .map(t => LinkedHashSet(t._2)) + .reduce((set1, set2) => set1 ++ set2) + .map(set => set.zipWithIndex.toMap) + .flatMap(m => m.toList) + +val numberOfWords = wordCounts + .map(t => (t._2)) + .distinct(t => t) + .map(t => 1) + .reduce(_ + _); + +val idf: DataSet[(String, Double)] = calculateIDF(wordCounts) +val tf: DataSet[(Int, String, Int)] = wordCounts + +// docId, word, tfIdf +val tfIdf = tf.join(idf).where(1).equalTo(0) { + (t1, t2) => (t1._1, t1._2, t1._3.toDouble * t2._2) +} + +val res = tfIdf.crossWithTiny(numberOfWords) + // docId, word, tfIdf, numberOfWords + .map(t => (t._1._1, t._1._2, t._1._3, t._2)) + //assign every word its position + .joinWithHuge(dictionary).where(1).equalTo(0) --- End diff -- Use `.joinWithTiny` or `.join` or a broadcast variable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: basic TfidfTransformer
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/730#discussion_r31113926 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala --- @@ -0,0 +1,107 @@ +package org.apache.flink.ml.feature + +import org.apache.flink.api.scala.{DataSet, _} +import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.math.SparseVector + +import scala.math.log +import scala.util.hashing.MurmurHash3 +import scala.collection.mutable.LinkedHashSet; + +/** + * This transformer calcuates the term-frequency times inverse document frequency for the give DataSet of documents. + * The DataSet will be treated as the corpus of documents. + * + * The TF is the frequence of a word inside one document + * + * The IDF for a word is calculated: log(total number of documents / documents that contain the word) + 1 + * + * This transformer returns a SparseVector where the index is the hash of the word and the value the tf-idf. + * @author Ronny Bräunlich + * @author Vassil Dimov + * @author Filip Perisic + */ +class TfIdfTransformer extends Transformer[(Int, Seq[String]), (Int, SparseVector)] { + + override def transform(input: DataSet[(Int /* docId */ , Seq[String] /*The document */ )], transformParameters: ParameterMap): DataSet[(Int, SparseVector)] = { + +val params = transformParameters.get(StopWordParameter) + +// Here we will store the words in he form (docId, word, count) +// Count represent the occurrence of "word" in document "docId" +val wordCounts = input + //count the words + .flatMap(t => { + //create tuples docId, word, 1 + t._2.map(s => (t._1, s, 1)) +}) + .filter(t => !transformParameters.apply(StopWordParameter).contains(t._2)) + //group by document and word + .groupBy(0, 1) + // calculate the occurrence count of each word in specific document + .sum(2) + +val dictionary = wordCounts + .map(t => LinkedHashSet(t._2)) + .reduce((set1, set2) => set1 ++ set2) + .map(set => set.zipWithIndex.toMap) --- End diff -- Make sure that you have exactly one `map ; flatMap` chain. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: basic TfidfTransformer
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/730#discussion_r31113442 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala --- @@ -0,0 +1,107 @@ +package org.apache.flink.ml.feature + +import org.apache.flink.api.scala.{DataSet, _} +import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.math.SparseVector + +import scala.math.log +import scala.util.hashing.MurmurHash3 +import scala.collection.mutable.LinkedHashSet; + +/** + * This transformer calcuates the term-frequency times inverse document frequency for the give DataSet of documents. + * The DataSet will be treated as the corpus of documents. + * + * The TF is the frequence of a word inside one document + * + * The IDF for a word is calculated: log(total number of documents / documents that contain the word) + 1 + * + * This transformer returns a SparseVector where the index is the hash of the word and the value the tf-idf. + * @author Ronny Bräunlich + * @author Vassil Dimov + * @author Filip Perisic + */ +class TfIdfTransformer extends Transformer[(Int, Seq[String]), (Int, SparseVector)] { + + override def transform(input: DataSet[(Int /* docId */ , Seq[String] /*The document */ )], transformParameters: ParameterMap): DataSet[(Int, SparseVector)] = { + +val params = transformParameters.get(StopWordParameter) + +// Here we will store the words in he form (docId, word, count) +// Count represent the occurrence of "word" in document "docId" +val wordCounts = input + //count the words + .flatMap(t => { + //create tuples docId, word, 1 + t._2.map(s => (t._1, s, 1)) +}) + .filter(t => !transformParameters.apply(StopWordParameter).contains(t._2)) + //group by document and word + .groupBy(0, 1) + // calculate the occurrence count of each word in specific document + .sum(2) + +val dictionary = wordCounts + .map(t => LinkedHashSet(t._2)) + .reduce((set1, set2) => set1 ++ set2) --- End diff -- better use `groupReduce` which will give you the materialized set right away. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1735] Feature Hasher
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/665#issuecomment-100685242 Nice job! :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1735] Feature Hasher
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/665#discussion_r30005092 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/extraction/FeatureHasher.scala --- @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.feature.extraction + +import java.nio.charset.Charset + +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.feature.extraction.FeatureHasher.{NonNegative, NumFeatures} +import org.apache.flink.ml.math.{Vector, SparseVector} + +import scala.util.hashing.MurmurHash3 + + +/** This transformer turns sequences of symbolic feature names (strings) into + * flink.ml.math.SparseVectors, using a hash function to compute the matrix column corresponding + * to a name. Aka the hashing trick. + * The hash function employed is the signed 32-bit version of Murmurhash3. + * + * By default for [[FeatureHasher]] transformer numFeatures=2#94;20 and nonNegative=false. + * + * This transformer takes a [[Seq]] of strings and maps it to a + * feature [[Vector]]. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.common.Learner]] implementations which expect an input of + * [[Vector]]. + * + * @example + * {{{ + *val trainingDS: DataSet[Seq[String]] = env.fromCollection(data) + *val transformer = FeatureHasher().setNumFeatures(65536).setNonNegative(false) + * + *transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[FeatureHasher.NumFeatures]]: The number of features (entries) in the output vector; + * by default equal to 2^20 + * - [[FeatureHasher.NonNegative]]: Whether output vector should contain non-negative values only. + * When True, output values can be interpreted as frequencies. When False, output values will have + * expected value zero; by default equal to false + */ +class FeatureHasher extends Transformer[Seq[String], Vector] with Serializable { + + // The seed used to initialize the hasher + val Seed = 0 + + /** Sets the number of features (entries) in the output vector +* +* @param numFeatures the user-specified numFeatures value. In case the user gives a value less +*than 1, numFeatures is set to its default value: 2^20 +* @return the FeatureHasher instance with its numFeatures value set to the user-specified value +*/ + def setNumFeatures(numFeatures: Int): FeatureHasher = { +// number of features must be greater zero +if(numFeatures < 1) { + return this --- End diff -- This might cause a small debugging hell. Throw a `RuntimeException` or at least log a `WARN` message here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1735] Feature Hasher
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/665#discussion_r30005080 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/extraction/FeatureHasher.scala --- @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.feature.extraction + +import java.nio.charset.Charset + +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.feature.extraction.FeatureHasher.{NonNegative, NumFeatures} +import org.apache.flink.ml.math.{Vector, SparseVector} + +import scala.util.hashing.MurmurHash3 + + +/** This transformer turns sequences of symbolic feature names (strings) into + * flink.ml.math.SparseVectors, using a hash function to compute the matrix column corresponding + * to a name. Aka the hashing trick. + * The hash function employed is the signed 32-bit version of Murmurhash3. + * + * By default for [[FeatureHasher]] transformer numFeatures=2#94;20 and nonNegative=false. + * + * This transformer takes a [[Seq]] of strings and maps it to a + * feature [[Vector]]. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.common.Learner]] implementations which expect an input of + * [[Vector]]. + * + * @example + * {{{ + *val trainingDS: DataSet[Seq[String]] = env.fromCollection(data) + *val transformer = FeatureHasher().setNumFeatures(65536).setNonNegative(false) + * + *transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[FeatureHasher.NumFeatures]]: The number of features (entries) in the output vector; + * by default equal to 2^20 + * - [[FeatureHasher.NonNegative]]: Whether output vector should contain non-negative values only. + * When True, output values can be interpreted as frequencies. When False, output values will have + * expected value zero; by default equal to false + */ +class FeatureHasher extends Transformer[Seq[String], Vector] with Serializable { + + // The seed used to initialize the hasher + val Seed = 0 + + /** Sets the number of features (entries) in the output vector +* +* @param numFeatures the user-specified numFeatures value. In case the user gives a value less +*than 1, numFeatures is set to its default value: 2^20 +* @return the FeatureHasher instance with its numFeatures value set to the user-specified value +*/ + def setNumFeatures(numFeatures: Int): FeatureHasher = { +// number of features must be greater zero +if(numFeatures < 1) { + return this +} +parameters.add(NumFeatures, numFeatures) +this + } + + /** Sets whether output vector should contain non-negative values only +* +* @param nonNegative the user-specified nonNegative value. +* @return the FeatureHasher instance with its nonNegative value set to the user-specified value +*/ + def setNonNegative(nonNegative: Boolean): FeatureHasher = { +parameters.add(NonNegative, nonNegative) +this + } + + override def transform(input: DataSet[Seq[String]], parameters: ParameterMap): + DataSet[Vector] = { +val resultingParameters = this.parameters ++ parameters + +val nonNegative = resultingParameters(NonNegative) +val numFeatures = resultingParameters(NumFeatures) + +// each item of the sequence is hashed and transformed into a tuple (index, value) +input.map { + inputSeq => { +val entries = inputSeq.map { + s => { +// unicode strings are converted to utf-8 +// bytesHash is faster than arrayHash, becau
[GitHub] flink pull request: [FLINK-1735] Feature Hasher
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/665#discussion_r30005062 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/extraction/FeatureHasher.scala --- @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.feature.extraction + +import java.nio.charset.Charset + +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.feature.extraction.FeatureHasher.{NonNegative, NumFeatures} +import org.apache.flink.ml.math.{Vector, SparseVector} + +import scala.util.hashing.MurmurHash3 + + +/** This transformer turns sequences of symbolic feature names (strings) into + * flink.ml.math.SparseVectors, using a hash function to compute the matrix column corresponding + * to a name. Aka the hashing trick. + * The hash function employed is the signed 32-bit version of Murmurhash3. + * + * By default for [[FeatureHasher]] transformer numFeatures=2#94;20 and nonNegative=false. + * + * This transformer takes a [[Seq]] of strings and maps it to a + * feature [[Vector]]. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.common.Learner]] implementations which expect an input of + * [[Vector]]. + * + * @example + * {{{ + *val trainingDS: DataSet[Seq[String]] = env.fromCollection(data) + *val transformer = FeatureHasher().setNumFeatures(65536).setNonNegative(false) + * + *transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[FeatureHasher.NumFeatures]]: The number of features (entries) in the output vector; + * by default equal to 2^20 + * - [[FeatureHasher.NonNegative]]: Whether output vector should contain non-negative values only. + * When True, output values can be interpreted as frequencies. When False, output values will have + * expected value zero; by default equal to false + */ +class FeatureHasher extends Transformer[Seq[String], Vector] with Serializable { + + // The seed used to initialize the hasher + val Seed = 0 + + /** Sets the number of features (entries) in the output vector +* +* @param numFeatures the user-specified numFeatures value. In case the user gives a value less +*than 1, numFeatures is set to its default value: 2^20 +* @return the FeatureHasher instance with its numFeatures value set to the user-specified value +*/ + def setNumFeatures(numFeatures: Int): FeatureHasher = { +// number of features must be greater zero +if(numFeatures < 1) { + return this +} +parameters.add(NumFeatures, numFeatures) +this + } + + /** Sets whether output vector should contain non-negative values only +* +* @param nonNegative the user-specified nonNegative value. +* @return the FeatureHasher instance with its nonNegative value set to the user-specified value +*/ + def setNonNegative(nonNegative: Boolean): FeatureHasher = { +parameters.add(NonNegative, nonNegative) +this + } + + override def transform(input: DataSet[Seq[String]], parameters: ParameterMap): + DataSet[Vector] = { +val resultingParameters = this.parameters ++ parameters + +val nonNegative = resultingParameters(NonNegative) +val numFeatures = resultingParameters(NumFeatures) + +// each item of the sequence is hashed and transformed into a tuple (index, value) +input.map { + inputSeq => { +val entries = inputSeq.map { + s => { +// unicode strings are converted to utf-8 +// bytesHash is faster than arrayHash, becau
[GitHub] flink pull request: [FLINK-1735] Feature Hasher
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/665#discussion_r30004988 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/feature/extraction/FeatureHasherSuite.scala --- @@ -0,0 +1,245 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.feature.extraction + +import org.apache.flink.api.scala.{ExecutionEnvironment, _} +import org.apache.flink.ml.math.SparseVector +import org.apache.flink.test.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + +class FeatureHasherSuite + extends FlatSpec + with Matchers + with FlinkTestBase { + + behavior of "Flink's Feature Hasher" + + import FeatureHasherData._ + + it should "transform a sequence of strings into a sparse feature vector of given size" in { +val env = ExecutionEnvironment.getExecutionEnvironment + +env.setParallelism(2) + +for (numFeatures <- numFeaturesTest) { + val inputDS = env.fromCollection(input) + + val transformer = FeatureHasher() +.setNumFeatures(numFeatures) + + val transformedDS = transformer.transform(inputDS) + val results = transformedDS.collect() + + for ((result, expectedResult) <- results zip expectedResults(numFeatures)) { +result.equalsVector(expectedResult) should be(true) + } +} + } + + it should "transform a sequence of strings into a sparse feature vector of given size," + +"with non negative entries" in { +val env = ExecutionEnvironment.getExecutionEnvironment + +env.setParallelism(2) + +for (numFeatures <- numFeaturesTest) { + val inputDS = env.fromCollection(input) + + val transformer = FeatureHasher() +.setNumFeatures(numFeatures).setNonNegative(true) + + val transformedDS = transformer.transform(inputDS) + val results = transformedDS.collect() + + for ((result, expectedResult) <- results zip expectedResultsNonNegative(numFeatures)) { +result.equalsVector(expectedResult) should be(true) + } +} + } + + it should "transform a sequence of strings into a sparse feature vector of default size," + +" when parameter is less than 1" in { +val env = ExecutionEnvironment.getExecutionEnvironment + +env.setParallelism(2) + +val inputDS = env.fromCollection(input) + +val numFeatures = 0 + +val transformer = FeatureHasher() + .setNumFeatures(numFeatures).setNonNegative(false) + +val transformedDS = transformer.transform(inputDS) +val results = transformedDS.collect() + +for (result <- results) { + result.size should equal(Math.pow(2, 20).toInt) +} + } +} + +object FeatureHasherData { + + val input = Seq( +"Two households both alike in dignity".split(" ").toSeq, +"In fair Verona where we lay our scene".split(" ").toSeq, +"From ancient grudge break to new mutiny".split(" ").toSeq, +"Where civil blood makes civil hands unclean".split(" ").toSeq, +"From forth the fatal loins of these two foes".split(" ").toSeq + ) + + /* 2^30 features can't be tested right now because the implementation of Vector.equalsVector + performs an index wise comparison on the two vectors, which takes approx. forever */ + val numFeaturesTest = Seq(Math.pow(2, 4).toInt, Math.pow(2, 5).toInt, 1234, +Math.pow(2, 16).toInt, Math.pow(2, 20).toInt) //, Math.pow(2, 30).toInt) + + val expectedResults = List( +16 -> List( + SparseVector.fromCOO(16, Map((0, 1.0), (1, 1.0), (2, -1.0), (14, -1.0))),
[GitHub] flink pull request: [FLINK-1789] [core] [runtime] [java-api] Allow...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/593#issuecomment-95559261 @aljoscha I'm sorry, I cannot follow? Can you elaborate? The idea is to add support for proper folders next to jars when opening an execution environment. Since folders cannot be handled the same way as jars during execution (i.e., serialize and ship them around by the blob manager), the assumption for folder paths is that they are accessible from all agents in the distributed runtime (JobManager + TaskManagers), e.g., via shared NFS folders. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1297] Added OperatorStatsAccumulator fo...
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/605#discussion_r28818529 --- Diff: flink-contrib/src/test/java/org/apache/flink/contrib/operatorstatistics/OperatorStatsAccumulatorsTest.java --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.contrib.operatorstatistics; + +import org.apache.flink.api.common.JobExecutionResult; +import org.apache.flink.api.common.accumulators.Accumulator; +import org.apache.flink.api.common.functions.RichFlatMapFunction; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.api.java.io.DiscardingOutputFormat; +import org.apache.flink.api.java.tuple.Tuple1; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.test.util.AbstractTestBase; +import org.apache.flink.util.Collector; +import org.junit.Assert; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.util.Map; +import java.util.Random; + +public class OperatorStatsAccumulatorsTest extends AbstractTestBase { + + private static final Logger LOG = LoggerFactory.getLogger(OperatorStatsAccumulatorsTest.class); + + private static final String ACCUMULATOR_NAME = "op-stats"; + + public OperatorStatsAccumulatorsTest(){ + super(new Configuration()); + } + + @Test + public void testAccumulator() throws Exception { + + String input = ""; + + Random rand = new Random(); + + for (int i = 1; i < 1000; i++) { + if(rand.nextDouble()<0.2){ + input+=String.valueOf(rand.nextInt(5))+"\n"; + }else{ + input+=String.valueOf(rand.nextInt(100))+"\n"; + } + } + + String inputFile = createTempFile("datapoints.txt", input); + + ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + + env.readTextFile(inputFile). + flatMap(new StringToInt()). + output(new DiscardingOutputFormat>()); + + JobExecutionResult result = env.execute(); + + OperatorStatistics globalStats = result.getAccumulatorResult(ACCUMULATOR_NAME); + LOG.debug("Global Stats"); + LOG.debug(globalStats.toString()); + + OperatorStatistics merged = null; + + Map accResults = result.getAllAccumulatorResults(); + for (String accumulatorName:accResults.keySet()){ + if (accumulatorName.contains(ACCUMULATOR_NAME+"-")){ + OperatorStatistics localStats = (OperatorStatistics) accResults.get(accumulatorName); + if (merged == null){ + merged = localStats.clone(); + }else { + merged.merge(localStats); + } + LOG.debug("Local Stats: " + accumulatorName); + LOG.debug(localStats.toString()); + } + } + + Assert.assertEquals(globalStats.cardinality,999); + Assert.assertEquals(globalStats.estimateCountDistinct(),100); + Assert.assertTrue(globalStats.getHeavyHitters().size()>0 && globalStats.getHeavyHitters().size()<=5); + Assert.assertEquals(merged.getMin(),globalStats.getMin()); + Assert.assertEquals(merged.getMax(),globalStats.getMax()); + Assert.assertEquals(merged.estimateCountDistinct(),globalStats.estima
[GitHub] flink pull request: [FLINK-1297] Added OperatorStatsAccumulator fo...
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/605#discussion_r28799879 --- Diff: flink-contrib/src/test/java/org/apache/flink/contrib/operatorstatistics/OperatorStatsAccumulatorsTest.java --- @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.contrib.operatorstatistics; + +import org.apache.flink.api.common.JobExecutionResult; +import org.apache.flink.api.common.accumulators.Accumulator; +import org.apache.flink.api.common.functions.RichFlatMapFunction; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.api.java.io.DiscardingOutputFormat; +import org.apache.flink.api.java.tuple.Tuple1; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.test.util.AbstractTestBase; +import org.apache.flink.util.Collector; +import org.junit.Assert; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.util.Map; +import java.util.Random; + +public class OperatorStatsAccumulatorsTest extends AbstractTestBase { + + private static final Logger LOG = LoggerFactory.getLogger(OperatorStatsAccumulatorsTest.class); + + private static final String ACCUMULATOR_NAME = "op-stats"; + + public OperatorStatsAccumulatorsTest(){ + super(new Configuration()); + } + + @Test + public void testAccumulator() throws Exception { + + String input = ""; + + Random rand = new Random(); + + for (int i = 1; i < 1000; i++) { + if(rand.nextDouble()<0.2){ + input+=String.valueOf(rand.nextInt(5))+"\n"; + }else{ + input+=String.valueOf(rand.nextInt(100))+"\n"; + } + } + + String inputFile = createTempFile("datapoints.txt", input); + + ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + + env.readTextFile(inputFile). + flatMap(new StringToInt()). + output(new DiscardingOutputFormat>()); + + JobExecutionResult result = env.execute(); + + OperatorStatistics globalStats = result.getAccumulatorResult(ACCUMULATOR_NAME); + LOG.debug("Global Stats"); + LOG.debug(globalStats.toString()); + + OperatorStatistics merged = null; + + Map accResults = result.getAllAccumulatorResults(); + for (String accumulatorName:accResults.keySet()){ + if (accumulatorName.contains(ACCUMULATOR_NAME+"-")){ + OperatorStatistics localStats = (OperatorStatistics) accResults.get(accumulatorName); + if (merged == null){ + merged = localStats.clone(); + }else { + merged.merge(localStats); + } + LOG.debug("Local Stats: " + accumulatorName); + LOG.debug(localStats.toString()); + } + } + + Assert.assertEquals(globalStats.cardinality,999); + Assert.assertEquals(globalStats.estimateCountDistinct(),100); + Assert.assertTrue(globalStats.getHeavyHitters().size()>0 && globalStats.getHeavyHitters().size()<=5); + Assert.assertEquals(merged.getMin(),globalStats.getMin()); + Assert.assertEquals(merged.getMax(),globalStats.getMax()); + Assert.assertEquals(merged.estimateCountDistinct(),globalStats.estima
[GitHub] flink pull request: [FLINK-1297] Added OperatorStatsAccumulator fo...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/605#issuecomment-93719337 :+1: thanks for the great work! I'll review that (probably over the weekend) and will appreciate if some of the core Flink committers (@sewen, @rmetzger, @fhueske) can also make a pass over the code. One more caveat from me: this implements only the runtime aspect of the statistics collecting logic. A second PR which allows to configure the points where statistics should be tracked in a programmatic way as part of the DataBag API shoud follow. @tammymendt and me were discussing as syntax along the lines of: ```scala A = // some dataflow assembly code A.withStatistics( "statsForX", keySelectorFn ) env.execute() // grab the statistics after the execution is done env.getAccumulator("statsForX") ``` Once this is in place we will play around and implement some ideas on incremental optimization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1789] [core] [runtime] [java-api] Allow...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/593#issuecomment-92841182 :clap: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixup ReusingKeyGroupedIterator value iterator...
GitHub user aalexandrov opened a pull request: https://github.com/apache/flink/pull/559 Fixup ReusingKeyGroupedIterator value iterators in NonReusingSortMergeCoGroupIterator. Correct me if I'm wrong, but I think that the NonReusingSortMergeCoGroupIterator should construct value iterators which are non-reusing as well, otherwise you have unexpected behavior like this: ```java public static void main(String[] args) throws Exception{ ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); env.getConfig().disableObjectReuse(); in1.coGroup(in2).where("name").equalTo("name").with(new SampleCoGroup()); env.execute() } public static class SampleCoGroup implements CoGroupFunction>{ @Override public void coGroup(Iterable iterable, Iterable iterable1, Collector> collector) throws Exception { for(StudentInfo info : iterable1) { infos.add(info); // faulty behavior } // ... } } ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/aalexandrov/flink fix_reusing_iterator Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/559.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #559 commit c90d15cbedb1e5b587ac527e65659a9b85710d9f Author: Alexander Alexandrov Date: 2015-04-01T16:08:26Z Fixup ReusingKeyGroupedIterator value iterators in NonReusingSortMergeCoGroupIterator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-83989884 I changed it as @StephanEwen suggested. Can somebody with admin access please cancel all previous pending builds for this PR here: https://travis-ci.org/apache/flink/pull_requests We can save some energy and time :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-83985616 All but one of the profiles (including the two scala 2.11 profiles) build with the latest rebase: https://travis-ci.org/stratosphere/flink/builds/55147681 The one that fails stalls on a `flink-streaming-core` test case which I think is not related to the Scala 2.11 migration (actually it is a build that uses 2.10). I suggest to adapt the `.travis.yml` as follows in order to keep the amount of builds <= 5: * add '-Dscala2.11 to one of the hadoop1 profiles * add '-Dscala2.11 to one of the hadoop2 profiles and then merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-83614068 > If I understand correctly, this does not affect the default mode of building against Scala 2.10 Correct. It just adds a profile that (optionally) builds with 2.11 and two Travis profiles that use the option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-83613932 > Can we "retrofit" some of the other builds to use Scala 2.11 ? I guess @rmetzger should answer this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-83521096 Rebased against the current master (798e59523f0599096a214c24045a8f011b53ddbc). Can we please proceed with merging? The POMs are changing on a daily basis, and I have to manually adapt them after each change. The current PR at least adds a profile and runs a travis build that will complain if the code or the POMs are not 2.11 compatible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-83491250 One more thing. If we agree to add suffixes and ship multi-build artifacts, and we are going to restructure the maven projects anyway, why not add suffixes to both sets of artifacts? I understand the appeal of adding suffix only for 2.11 for backward compatibility reasons, but with the envisioned package restructure everybody will have to touch their poms when migrating client code anyway. Adding a suffix consistently for all versions makes dependency management easier on the client side. Users can then configure their dependency entries like this ```xml 2.11 org.apache.flink flink-core_${scala.tools.version} 0.9 org.scalanlp breeze_${scala.tools.version} 0.10 ``` and switch from 2.10 to 2.11 by just changing the property value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-82930555 Can we merge this before handling [FLINK-1712](https://issues.apache.org/jira/browse/FLINK-1712)? I can open a separate PR that adds the suffixes and automates the deployment of two sets of artifacts later if we agree on that. The current PR at least introduces scala-2.11 and makes sure that newly contributed Scala code is both 2.10 and 2.11 compatible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-82266786 If we go with the suffix, we basically have two options: 1. Add the suffix only to modules that use Scala 1. Add the suffix to all maven modules, regardless whether they use Scala or not Downside of option (1) is that we might break split names incrementally if we add Scala in the future. Downside of option (2) is the more LOC that need to be adapted in the POMs. My two cents are for (2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-8159 H... now it passes. This is weird... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-79048001 The scala-2.11 build ```bash mvn -Dscala-2.11 clean install ``` fails on travis when trying to build the quickstart (@rmetzger reported the same problem above). It seems that the `flink-quickstart-scala` package [downloads and uses](https://travis-ci.org/stratosphere/flink/jobs/54233598#L8055) the current nightlies of `flink-scala` from Sonatype instead of using the package local version. On my local machine, the same command executes without a problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-78942601 It seems that I accidentally removed the `maven-shade-plugin` section from the new `pom.xml` while rebasing. Let's see whether the tests pass now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-78502251 [Some of the tests failed](https://travis-ci.org/apache/flink/builds/53879571) but I'm not sure whether this is related to the changes or not. I get the following error in the three failed Travis jobs: ``` [ERROR] Failed to execute goal on project XXX: Could not resolve dependencies for project org.apache.flink.archetypetest:testArtifact:jar:0.1: Could not find artifact org.apache.flink:flink-shaded-include-yarn:jar:0.9-SNAPSHOT in sonatype-snapshots (https://oss.sonatype.org/content/repositories/snapshots/) ``` Could it be that the new artifact has not yet made it to Sonatype? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-78307840 Any remarks on the my reply to the general comments from @rmetzger (scala version, suffix)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-78262466 > Why are you using Scala 2.11.4? The mentioned bug seems to be fixed in 2.11.6. 2.11.6 will be out in April. It is between 2.11.5 (which is known to have issues) and 2.11.4. > Why aren't we adding a _2.11 suffix to the Scala 2.11 Flink builds? We can do this, and it certainly makes sense if you want to ship pre-builds of both versions. With the current setup if you want to use Flink with 2.11 you have to build and install the maven projects yourself (I'm blindly following the Spark model here, let me know if you prefer the other option). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-78348901 I rebased on the current master (includes the changes from PR #454 merged today). I'll take a look at the errors thrown on building later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/477#issuecomment-78333617 > The reason why I asked regarding Scala 2.11.6 was because this version is shown on the scala-lang website next to the download button. Snap, I guess Christmas came early this year :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/477#discussion_r26210582 --- Diff: pom.xml --- @@ -599,6 +599,38 @@ under the License. + scala-2.10 + + + + !scala-2.11.profile + + + + + 2.10.4 + 2.10 + 2.0.1 + 2.3.7 --- End diff -- OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
Github user aalexandrov commented on a diff in the pull request: https://github.com/apache/flink/pull/477#discussion_r26210554 --- Diff: flink-scala/pom.xml --- @@ -236,4 +230,23 @@ under the License. + + + scala-2.10 + + + + !scala-2.11.profile --- End diff -- I have created two profiles `scala-2.10` and `scala-2.11` and configured the activation to be mutually exclusive based a dedicated environment variable (`scala-2.11.profile`, but could be changed to `scala-2.11`). If you want to build with with 2.10 or 2.11, you do: ```bash mvn package # for 2.10 mvn package -Dscala-2.11.profile # for 2.11 ``` The 2.10 profile is (implicitly) activated by default at the moment because the activation environment variable `scala-2.11.profile` is not set by default. I can rewrite it as you suggested (explicit activation based on profile names), but then the syntax for building with 2.11 becomes somewhat more cumbersome: ```bash mvn package # for 2.10 mvn package -P!scala-2.10,scala-2.11 # for 2.11 ``` Bare in mind, the current setup does not prohibit you to forcefully activate / deactivate profiles from the IDE or based on their names. The second set of commands should still work (I will verify this in a minute). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Add support for building Flink with Scala 2.11
GitHub user aalexandrov opened a pull request: https://github.com/apache/flink/pull/477 Add support for building Flink with Scala 2.11 This adds support for building Flink with Scala 2.11 as [per mailing list discussion](http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Offer-Flink-with-Scala-2-11-td4159.html). I fixed the 2.11 version to 2.11.4 because 2.11.5 contains a [somewhat nasty bug](https://issues.scala-lang.org/browse/SI-9089) that might cause issues in some situations. I also added a travis profile that builds with the `scala-2.11` profile enabled and `oraclejdk7` -- you might wish to change that to `oraclejdk8` or something else. Once this is merged I can take a look at the deploy scripts and extend them so we ship and deploy pre-builds for both 2.10 and 2.11. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stratosphere/flink scala_2.11 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/477.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #477 commit e3b021cb5ebbc3ef6250fd6c1777b4fa55f34952 Author: Alexander Alexandrov Date: 2015-02-27T21:53:32Z Migrated Scala version to 2.11.4 in the project POMs. commit f655761586df24d9263f211121437f933838fd69 Author: Alexander Alexandrov Date: 2015-02-27T21:53:54Z Migrated Scala code to 2.11.4. commit 896d3a4185962c30384f690021e193615fc66f98 Author: Alexander Alexandrov Date: 2015-03-10T23:35:18Z Added a travis profile for a scala-2.11 build. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Remove 'incubator-' prefix from README.md.
GitHub user aalexandrov opened a pull request: https://github.com/apache/flink/pull/371 Remove 'incubator-' prefix from README.md. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aalexandrov/flink patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/371.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #371 commit f8da7b89b4fa74c621983c7d0273d840349cf095 Author: Alexander Alexandrov Date: 2015-02-06T14:23:02Z Remove 'incubator-' prefix from README.md. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Allow KeySelectors to implement ResultTypeQuer...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/354#issuecomment-72857627 I think that this only happens for the primitive types. I think this is by design. If you want to inspect generic parameters, Scala forces you to use the Scala reflection API. There is no way around that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Allow KeySelectors to implement ResultTypeQuer...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/354#issuecomment-72660810 I think that [this StackOverflow article explains my problem](http://stackoverflow.com/questions/11586944/how-to-obtain-the-raw-datatype-of-a-parameter-of-a-field-that-is-specialized-in). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Allow KeySelectors to implement ResultTypeQuer...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/354#issuecomment-72660318 I would advocate to adding this one as well as a fallback option. I have a situation where I want to use KeySelector that might return Java TupleXX instances parameterized with Scala types, e.g.: ``` class SelectFoo extends KeySelector[Tuple3[Int, Int, String], Tuple3[Int, String]] { override def getKey(v: Tuple3[Int, Int, Int]) = new Tuple2(v.f0, v.f2) } `` Even though in this cases the generic parameters are available, the Scala types cannot be inferred because the actual type field type parameters are erased by Scala and are seen only as java.lang.Object from the Java reflection API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Added ResultTypeQueryable interface to TypeSer...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/349#issuecomment-72284550 @hsaputra: I just did: [FLINK-1464](https://issues.apache.org/jira/browse/FLINK-1464). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Added ResultTypeQueryable interface to TypeSer...
GitHub user aalexandrov opened a pull request: https://github.com/apache/flink/pull/349 Added ResultTypeQueryable interface to TypeSerializerInputFormat. It is currently impossible to use the `TypeSerializerInputFormat` with generic Tuple types. For example, [this example gist](https://gist.github.com/aalexandrov/90bf21f66bf604676f37) fails with a ``` Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: The type returned by the input format could not be automatically determined. Please specify the TypeInformation of the produced type explicitly. at org.apache.flink.api.java.ExecutionEnvironment.readFile(ExecutionEnvironment.java:341) at SerializedFormatExample$.main(SerializedFormatExample.scala:48) at SerializedFormatExample.main(SerializedFormatExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) ``` exaception. To fix the issue, I changed the constructor to take a `TypeInformation` instad of a `TypeSerializer` argument. If this is indeed a bug, I think that this is a good solution. Unfortunately the fix breaks the API. Feel free to change it if you find a more elegant solution compatible with the 0.8 branch. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aalexandrov/flink typeserializerinputformat_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/349.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #349 commit 2c1816640574c182f3b9f556ec5bab5df6f56f2c Author: Alexander Alexandrov Date: 2015-01-29T14:41:01Z Added ResultTypeQueryable interface implementation to TypeSerializerInputFormat. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink commit comment: 9ee74f3cf7b04740d6b00edf45d9413ff40322cf
Github user aalexandrov commented on commit 9ee74f3cf7b04740d6b00edf45d9413ff40322cf: https://github.com/apache/flink/commit/9ee74f3cf7b04740d6b00edf45d9413ff40322cf#commitcomment-9297124 In .gitignore: In .gitignore on line 1: Is removing all these lines a standard procedure for the release branches? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-986] [FLINK-25] [Distributed runtime] A...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/254#issuecomment-69539129 YES!!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-986] [FLINK-25] [Distributed runtime] A...
Github user aalexandrov commented on the pull request: https://github.com/apache/flink/pull/254#issuecomment-69343812 I am excited! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---