date:20150527

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-105840358
  
The tests are failing with checkstyle violations. Can you fix those?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs


[ 
https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560695#comment-14560695
 ] 

ASF GitHub Bot commented on FLINK-1319:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-105840358
  
The tests are failing with checkstyle violations. Can you fix those?


 Add static code analysis for UDFs
 -

 Key: FLINK-1319
 URL: https://issues.apache.org/jira/browse/FLINK-1319
 Project: Flink
  Issue Type: New Feature
  Components: Java API, Scala API
Reporter: Stephan Ewen
Assignee: Timo Walther
Priority: Minor

 Flink's Optimizer takes information that tells it for UDFs which fields of 
 the input elements are accessed, modified, or frwarded/copied. This 
 information frequently helps to reuse partitionings, sorts, etc. It may speed 
 up programs significantly, as it can frequently eliminate sorts and shuffles, 
 which are costly.
 Right now, users can add lightweight annotations to UDFs to provide this 
 information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}.
 We worked with static code analysis of UDFs before, to determine this 
 information automatically. This is an incredible feature, as it magically 
 makes programs faster.
 For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this 
 works surprisingly well in many cases. We used the Soot toolkit for the 
 static code analysis. Unfortunately, Soot is LGPL licensed and thus we did 
 not include any of the code so far.
 I propose to add this functionality to Flink, in the form of a drop-in 
 addition, to work around the LGPL incompatibility with ALS 2.0. Users could 
 simply download a special flink-code-analysis.jar and drop it into the 
 lib folder to enable this functionality. We may even add a script to 
 tools that downloads that library automatically into the lib folder. This 
 should be legally fine, since we do not redistribute LGPL code and only 
 dynamically link it (the incompatibility with ASL 2.0 is mainly in the 
 patentability, if I remember correctly).
 Prior work on this has been done by [~aljoscha] and [~skunert], which could 
 provide a code base to start with.
 *Appendix*
 Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/
 Papers on static analysis and for optimization: 
 http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and 
 http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
 Quick introduction to the Optimizer: 
 http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf 
 (Section 6)
 Optimizer for Iterations: 
 http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf 
 (Sections 4.3 and 5.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/695#issuecomment-105845451
  
I tried to merge this, but it is quite unmergeable.

Before we can add this, we need you to rebase this pull request and squash 
all commits into one commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Flink Storm compatibility

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/573#issuecomment-105849950
  
Anything blocking this pull request, or can it be merged?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1319][core] Add static code analysis fo...

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-105814322
  
Indeed, this looks like an impressive addition.

Let's get it into 0.9 as a beta feature!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: basic TfidfTransformer

2015-05-27 Thread aalexandrov

Github user aalexandrov commented on a diff in the pull request:

https://github.com/apache/flink/pull/730#discussion_r31113442
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala
 ---
@@ -0,0 +1,107 @@
+package org.apache.flink.ml.feature
+
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer}
+import org.apache.flink.ml.math.SparseVector
+
+import scala.math.log
+import scala.util.hashing.MurmurHash3
+import scala.collection.mutable.LinkedHashSet;
+
+/**
+ * This transformer calcuates the term-frequency times inverse document 
frequency for the give DataSet of documents.
+ * The DataSet will be treated as the corpus of documents.
+ * p
+ * The TF is the frequence of a word inside one document
+ * p
+ * The IDF for a word is calculated: log(total number of documents / 
documents that contain the word) + 1
+ * p
+ * This transformer returns a SparseVector where the index is the hash of 
the word and the value the tf-idf.
+ * @author Ronny BrÃ¤unlich
+ * @author Vassil Dimov
+ * @author Filip Perisic
+ */
+class TfIdfTransformer extends Transformer[(Int, Seq[String]), (Int, 
SparseVector)] {
+
+  override def transform(input: DataSet[(Int /* docId */ , Seq[String] 
/*The document */ )], transformParameters: ParameterMap): DataSet[(Int, 
SparseVector)] = {
+
+val params = transformParameters.get(StopWordParameter)
+
+// Here we will store the words in he form (docId, word, count)
+// Count represent the occurrence of word in document docId
+val wordCounts = input
+  //count the words
+  .flatMap(t = {
+  //create tuples docId, word, 1
+  t._2.map(s = (t._1, s, 1))
+})
+  .filter(t = 
!transformParameters.apply(StopWordParameter).contains(t._2))
+  //group by document and word
+  .groupBy(0, 1)
+  // calculate the occurrence count of each word in specific document
+  .sum(2)
+
+val dictionary = wordCounts
+  .map(t = LinkedHashSet(t._2))
+  .reduce((set1, set2) = set1 ++ set2)
--- End diff --

better use `groupReduce` which will give you the materialized set right 
away.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [core] cleanup tests for FileInputFormat

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/732#issuecomment-105840172
  
Seems reasonable.

+1 to merge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2047] [ml] Rename CoCoA to SVM

GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/733

[FLINK-2047] [ml]  Rename CoCoA to SVM

The CoCoA algorithm as implemented functions as an SVM classifier.

As CoCoA mostly concerns the optimization process and not the actual 
learning algorithm, it makes sense to rename the learner to SVM which users are 
more familiar with.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink cocoa-svm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/733.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #733


commit 5768235ca73de8b03e9a523b7853a57f058806f1
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-05-27T10:00:17Z

Renaming CoCoA to SVM




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2047) Rename CoCoA to SVM


[ 
https://issues.apache.org/jira/browse/FLINK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560719#comment-14560719
 ] 

ASF GitHub Bot commented on FLINK-2047:
---

GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/733

[FLINK-2047] [ml]  Rename CoCoA to SVM

The CoCoA algorithm as implemented functions as an SVM classifier.

As CoCoA mostly concerns the optimization process and not the actual 
learning algorithm, it makes sense to rename the learner to SVM which users are 
more familiar with.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink cocoa-svm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/733.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #733


commit 5768235ca73de8b03e9a523b7853a57f058806f1
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-05-27T10:00:17Z

Renaming CoCoA to SVM




 Rename CoCoA to SVM
 ---

 Key: FLINK-2047
 URL: https://issues.apache.org/jira/browse/FLINK-2047
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Trivial
 Fix For: 0.9


 The CoCoA algorithm as implemented functions as an SVM classifier.
 As CoCoA mostly concerns the optimization process and not the actual learning 
 algorithm, it makes sense to rename the learner to SVM which users are more 
 familiar with.
 In the future we would like to use the CoCoA algorithm to solve more large 
 scale optimization problems for other learning algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1952) Cannot run ConnectedComponents example: Could not allocate a slot on instance


[ 
https://issues.apache.org/jira/browse/FLINK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560726#comment-14560726
 ] 

ASF GitHub Bot commented on FLINK-1952:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/731#discussion_r31119327
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/instance/Slot.java ---
@@ -24,27 +24,40 @@
 import java.util.concurrent.atomic.AtomicIntegerFieldUpdater;
 
 /**
- * Base class for task slots. TaskManagers offer one or more task slots, 
they define how many
- * parallel tasks or task groups a TaskManager executes.
+ * Base class for task slots. TaskManagers offer one or more task slots, 
which define a slice of 
+ * their resources.
+ *
+ * pIn the simplest case, a slot holds a single task ({@link 
SimpleSlot}). In the more complex
+ * case, a slot is shared ({@link SharedSlot}) abd contains a set of 
tasks. Shared slots may contain
--- End diff --

typo: abd - and


 Cannot run ConnectedComponents example: Could not allocate a slot on instance
 -

 Key: FLINK-1952
 URL: https://issues.apache.org/jira/browse/FLINK-1952
 Project: Flink
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 0.9
Reporter: Robert Metzger
Priority: Blocker

 Steps to reproduce
 {code}
 ./bin/yarn-session.sh -n 350 
 {code}
 ... wait until they are connected ...
 {code}
 Number of connected TaskManagers changed to 266. Slots available: 266
 Number of connected TaskManagers changed to 323. Slots available: 323
 Number of connected TaskManagers changed to 334. Slots available: 334
 Number of connected TaskManagers changed to 343. Slots available: 343
 Number of connected TaskManagers changed to 350. Slots available: 350
 {code}
 Start CC
 {code}
 ./bin/flink run -p 350 
 ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar
 {code}
 --- it runs
 Run KMeans, let it fail with 
 {code}
 Failed to deploy the task Map (Map at main(KMeans.java:100)) (1/350) - 
 execution #0 to slot SimpleSlot (2)(2)(0) - 182b7661ca9547a84591de940c47a200 
 - ALLOCATED/ALIVE: java.io.IOException: Insufficient number of network 
 buffers: required 350, but only 254 available. The total number of network 
 buffers is currently set to 2048. You can increase this number by setting the 
 configuration key 'taskmanager.network.numberOfBuffers'.
 {code}
 ... as expected.
 (I've waited for 10 minutes between the two submissions)
 Starting CC now will fail:
 {code}
 ./bin/flink run -p 350 
 ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar 
 {code}
 Error message(s):
 {code}
 Caused by: java.lang.IllegalStateException: Could not schedule consumer 
 vertex IterationHead(WorksetIteration (Unnamed Delta Iteration)) (19/350)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:479)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:469)
   at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
   ... 4 more
 Caused by: 
 org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
 Could not allocate a slot on instance 4a6d761cb084c32310ece1f849556faf @ 
 cloud-19 - 1 slots - URL: 
 akka.tcp://flink@130.149.21.23:51400/user/taskmanager, as required by the 
 co-location constraint.
   at 
 org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:247)
   at 
 org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:110)
   at 
 org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:262)
   at 
 org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:436)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:475)
   ... 9 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1952] [jobmanager] Rework and fix slot ...

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/731#discussion_r31119327
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/instance/Slot.java ---
@@ -24,27 +24,40 @@
 import java.util.concurrent.atomic.AtomicIntegerFieldUpdater;
 
 /**
- * Base class for task slots. TaskManagers offer one or more task slots, 
they define how many
- * parallel tasks or task groups a TaskManager executes.
+ * Base class for task slots. TaskManagers offer one or more task slots, 
which define a slice of 
+ * their resources.
+ *
+ * pIn the simplest case, a slot holds a single task ({@link 
SimpleSlot}). In the more complex
+ * case, a slot is shared ({@link SharedSlot}) abd contains a set of 
tasks. Shared slots may contain
--- End diff --

typo: abd - and


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-2097) Add support for JobSessions

Stephan Ewen created FLINK-2097:
---

 Summary: Add support for JobSessions
 Key: FLINK-2097
 URL: https://issues.apache.org/jira/browse/FLINK-2097
 Project: Flink
  Issue Type: New Feature
  Components: JobManager
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Maximilian Michels
 Fix For: 0.9


Sessions make sure that the JobManager does not immediately discard a JobGraph 
after execution, but keeps it around for further operations to be attached to 
the graph. By keeping the JobGraph around, the cached streams (intermediate 
data) are also kept,

That is the way of realizing interactive sessions on top of a streaming 
dataflow abstraction.

ExecutionGraphs should be kept as long as
- no timeout occurred or
- the session has not been explicitly ended




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2073) Add contribution guide for FlinkML


[ 
https://issues.apache.org/jira/browse/FLINK-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560708#comment-14560708
 ] 

ASF GitHub Bot commented on FLINK-2073:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/727#discussion_r31118088
  
--- Diff: docs/libs/ml/contribution_guide.md ---
@@ -20,7 +21,329 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Flink community highly appreciates all sorts of contributions to 
FlinkML.
+FlinkML offers people interested in machine learning to work on a highly 
active open source project which makes scalable ML reality.
+The following document describes how to contribute to FlinkML.
+
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon. In the meantime, check our list of [open issues on 
JIRA](https://issues.apache.org/jira/browse/FLINK-1748?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC)
+## Getting Started
+
+In order to get started first read Flink's [contribution 
guide](http://flink.apache.org/how-to-contribute.html).
+Everything from this guide also applies to FlinkML.
+
+## Pick a Topic
+
+If you are looking for some new ideas, then you should check out the list 
of [unresolved issues on 
JIRA](https://issues.apache.org/jira/issues/?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC).
+Once you decide to contribute to one of these issues, you should take 
ownership of it and track your progress with this issue.
+That way, the other contributors know the state of the different issues 
and redundant work is avoided.
+
+If you already know what you want to contribute to FlinkML all the better.
+It is still advisable to create a JIRA issue for your idea to tell the 
Flink community what you want to do, though.
+
+## Testing
+
+New contributions should come with tests to verify the correct behavior of 
the algorithm.
+The tests help to maintain the algorithm's correctness throughout code 
changes, e.g. refactorings.
+
+We distinguish between unit tests, which are executed during maven's test 
phase, and integration tests, which are executed during maven's verify phase.
+Maven automatically makes this distinction by using the following naming 
rules:
+All test cases whose class name ends with a suffix fulfilling the regular 
expression `(IT|Integration)(Test|Suite|Case)`, are considered integration 
tests.
+The rest are considered unit tests and should only test behavior which is 
local to the component under test.
+
+An integration test is a test which requires the full Flink system to be 
started.
+In order to do that properly, all integration test cases have to mix in 
the trait `FlinkTestBase`.
+This trait will set the right `ExecutionEnvironment` so that the test will 
be executed on a special `FlinkMiniCluster` designated for testing purposes.
+Thus, an integration test could look the following:
+
+{% highlight scala %}
+class ExampleITSuite extends FlatSpec with FlinkTestBase {
+  behavior of An example algorithm
+  
+  it should do something in {
+...
+  }
+}
+{% endhighlight %}
+
+The test style does not have to be `FlatSpec` but can be any other 
scalatest `Suite` subclass. 
+
+## Documentation
+
+When contributing new algorithms, it is required to add code comments 
describing the functioning of the algorithm and its parameters with which the 
user can control its behavior.
+Additionally, we would like to encourage contributors to add this 
information to the online documentation.
+The online documentation for FlinkML's components can be found in the 
directory `docs/libs/ml`.
+
+Every new algorithm is described by a single markdown file.
+This file should contain at least the following points:
+
+1. What does the algorithm do
+2. How does the algorithm work (or reference to description) 
+3. Parameter description with default values
+4. Code snippet showing how the algorithm is used
+
+In order to use latex syntax in the markdown file, you have to include 
`mathjax: include` in the YAML front matter.
+ 
+{% highlight java %}
+---
+mathjax: include
+title: Example title
+---
+{% endhighlight %}
+
+In order to use displayed mathematics, you have to put your latex code in 
`$$ ... $$`.
+For in-line mathematics, use `$ ... $`.
+Additionally some predefined latex commands are included into the scope of 
your markdown file.
+See

[GitHub] flink pull request: [FLINK-2073] [ml] [docs] Adds contribution gui...

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/727#discussion_r31118088
  
--- Diff: docs/libs/ml/contribution_guide.md ---
@@ -20,7 +21,329 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Flink community highly appreciates all sorts of contributions to 
FlinkML.
+FlinkML offers people interested in machine learning to work on a highly 
active open source project which makes scalable ML reality.
+The following document describes how to contribute to FlinkML.
+
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon. In the meantime, check our list of [open issues on 
JIRA](https://issues.apache.org/jira/browse/FLINK-1748?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC)
+## Getting Started
+
+In order to get started first read Flink's [contribution 
guide](http://flink.apache.org/how-to-contribute.html).
+Everything from this guide also applies to FlinkML.
+
+## Pick a Topic
+
+If you are looking for some new ideas, then you should check out the list 
of [unresolved issues on 
JIRA](https://issues.apache.org/jira/issues/?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC).
+Once you decide to contribute to one of these issues, you should take 
ownership of it and track your progress with this issue.
+That way, the other contributors know the state of the different issues 
and redundant work is avoided.
+
+If you already know what you want to contribute to FlinkML all the better.
+It is still advisable to create a JIRA issue for your idea to tell the 
Flink community what you want to do, though.
+
+## Testing
+
+New contributions should come with tests to verify the correct behavior of 
the algorithm.
+The tests help to maintain the algorithm's correctness throughout code 
changes, e.g. refactorings.
+
+We distinguish between unit tests, which are executed during maven's test 
phase, and integration tests, which are executed during maven's verify phase.
+Maven automatically makes this distinction by using the following naming 
rules:
+All test cases whose class name ends with a suffix fulfilling the regular 
expression `(IT|Integration)(Test|Suite|Case)`, are considered integration 
tests.
+The rest are considered unit tests and should only test behavior which is 
local to the component under test.
+
+An integration test is a test which requires the full Flink system to be 
started.
+In order to do that properly, all integration test cases have to mix in 
the trait `FlinkTestBase`.
+This trait will set the right `ExecutionEnvironment` so that the test will 
be executed on a special `FlinkMiniCluster` designated for testing purposes.
+Thus, an integration test could look the following:
+
+{% highlight scala %}
+class ExampleITSuite extends FlatSpec with FlinkTestBase {
+  behavior of An example algorithm
+  
+  it should do something in {
+...
+  }
+}
+{% endhighlight %}
+
+The test style does not have to be `FlatSpec` but can be any other 
scalatest `Suite` subclass. 
+
+## Documentation
+
+When contributing new algorithms, it is required to add code comments 
describing the functioning of the algorithm and its parameters with which the 
user can control its behavior.
+Additionally, we would like to encourage contributors to add this 
information to the online documentation.
+The online documentation for FlinkML's components can be found in the 
directory `docs/libs/ml`.
+
+Every new algorithm is described by a single markdown file.
+This file should contain at least the following points:
+
+1. What does the algorithm do
+2. How does the algorithm work (or reference to description) 
+3. Parameter description with default values
+4. Code snippet showing how the algorithm is used
+
+In order to use latex syntax in the markdown file, you have to include 
`mathjax: include` in the YAML front matter.
+ 
+{% highlight java %}
+---
+mathjax: include
+title: Example title
+---
+{% endhighlight %}
+
+In order to use displayed mathematics, you have to put your latex code in 
`$$ ... $$`.
+For in-line mathematics, use `$ ... $`.
+Additionally some predefined latex commands are included into the scope of 
your markdown file.
+See `docs/_include/latex_commands.html` for the complete list of 
predefined latex commands.
+
+## Contributing
+
+Once you have implemented the algorithm with adequate test coverage and 
added documentation, you are ready to open a pull request.

[GitHub] flink pull request: [FLINK-2073] [ml] [docs] Adds contribution gui...

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/727#issuecomment-105843688
  
I addressed your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2073) Add contribution guide for FlinkML


[ 
https://issues.apache.org/jira/browse/FLINK-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560709#comment-14560709
 ] 

ASF GitHub Bot commented on FLINK-2073:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/727#issuecomment-105843688
  
I addressed your comments.


 Add contribution guide for FlinkML
 --

 Key: FLINK-2073
 URL: https://issues.apache.org/jira/browse/FLINK-2073
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Till Rohrmann
 Fix For: 0.9


 We need a guide for contributions to FlinkML in order to encourage the 
 extension of the library, and provide guidelines for developers.
 One thing that should be included is a step-by-step guide to create a 
 transformer, or other Estimator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [core] cleanup tests for FileInputFormat

2015-05-27 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/732


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2073] [ml] [docs] Adds contribution gui...

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/727#issuecomment-105861380
  
LGTM, only one minor comment. I also like the addition of descriptions for 
the fit and predict methods to each algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: basic TfidfTransformer

2015-05-27 Thread aalexandrov

Github user aalexandrov commented on a diff in the pull request:

https://github.com/apache/flink/pull/730#discussion_r31113926
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala
 ---
@@ -0,0 +1,107 @@
+package org.apache.flink.ml.feature
+
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer}
+import org.apache.flink.ml.math.SparseVector
+
+import scala.math.log
+import scala.util.hashing.MurmurHash3
+import scala.collection.mutable.LinkedHashSet;
+
+/**
+ * This transformer calcuates the term-frequency times inverse document 
frequency for the give DataSet of documents.
+ * The DataSet will be treated as the corpus of documents.
+ * p
+ * The TF is the frequence of a word inside one document
+ * p
+ * The IDF for a word is calculated: log(total number of documents / 
documents that contain the word) + 1
+ * p
+ * This transformer returns a SparseVector where the index is the hash of 
the word and the value the tf-idf.
+ * @author Ronny BrÃ¤unlich
+ * @author Vassil Dimov
+ * @author Filip Perisic
+ */
+class TfIdfTransformer extends Transformer[(Int, Seq[String]), (Int, 
SparseVector)] {
+
+  override def transform(input: DataSet[(Int /* docId */ , Seq[String] 
/*The document */ )], transformParameters: ParameterMap): DataSet[(Int, 
SparseVector)] = {
+
+val params = transformParameters.get(StopWordParameter)
+
+// Here we will store the words in he form (docId, word, count)
+// Count represent the occurrence of word in document docId
+val wordCounts = input
+  //count the words
+  .flatMap(t = {
+  //create tuples docId, word, 1
+  t._2.map(s = (t._1, s, 1))
+})
+  .filter(t = 
!transformParameters.apply(StopWordParameter).contains(t._2))
+  //group by document and word
+  .groupBy(0, 1)
+  // calculate the occurrence count of each word in specific document
+  .sum(2)
+
+val dictionary = wordCounts
+  .map(t = LinkedHashSet(t._2))
+  .reduce((set1, set2) = set1 ++ set2)
+  .map(set = set.zipWithIndex.toMap)
--- End diff --

Make sure that you have exactly one `map ; flatMap` chain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2073) Add contribution guide for FlinkML


[ 
https://issues.apache.org/jira/browse/FLINK-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560752#comment-14560752
 ] 

ASF GitHub Bot commented on FLINK-2073:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/727#issuecomment-105861380
  
LGTM, only one minor comment. I also like the addition of descriptions for 
the fit and predict methods to each algorithm.


 Add contribution guide for FlinkML
 --

 Key: FLINK-2073
 URL: https://issues.apache.org/jira/browse/FLINK-2073
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Till Rohrmann
 Fix For: 0.9


 We need a guide for contributions to FlinkML in order to encourage the 
 extension of the library, and provide guidelines for developers.
 One thing that should be included is a step-by-step guide to create a 
 transformer, or other Estimator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs


[ 
https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560609#comment-14560609
 ] 

ASF GitHub Bot commented on FLINK-1319:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-105814322
  
Indeed, this looks like an impressive addition.

Let's get it into 0.9 as a beta feature!


 Add static code analysis for UDFs
 -

 Key: FLINK-1319
 URL: https://issues.apache.org/jira/browse/FLINK-1319
 Project: Flink
  Issue Type: New Feature
  Components: Java API, Scala API
Reporter: Stephan Ewen
Assignee: Timo Walther
Priority: Minor

 Flink's Optimizer takes information that tells it for UDFs which fields of 
 the input elements are accessed, modified, or frwarded/copied. This 
 information frequently helps to reuse partitionings, sorts, etc. It may speed 
 up programs significantly, as it can frequently eliminate sorts and shuffles, 
 which are costly.
 Right now, users can add lightweight annotations to UDFs to provide this 
 information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}.
 We worked with static code analysis of UDFs before, to determine this 
 information automatically. This is an incredible feature, as it magically 
 makes programs faster.
 For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this 
 works surprisingly well in many cases. We used the Soot toolkit for the 
 static code analysis. Unfortunately, Soot is LGPL licensed and thus we did 
 not include any of the code so far.
 I propose to add this functionality to Flink, in the form of a drop-in 
 addition, to work around the LGPL incompatibility with ALS 2.0. Users could 
 simply download a special flink-code-analysis.jar and drop it into the 
 lib folder to enable this functionality. We may even add a script to 
 tools that downloads that library automatically into the lib folder. This 
 should be legally fine, since we do not redistribute LGPL code and only 
 dynamically link it (the incompatibility with ASL 2.0 is mainly in the 
 patentability, if I remember correctly).
 Prior work on this has been done by [~aljoscha] and [~skunert], which could 
 provide a code base to start with.
 *Appendix*
 Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/
 Papers on static analysis and for optimization: 
 http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and 
 http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
 Quick introduction to the Optimizer: 
 http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf 
 (Section 6)
 Optimizer for Iterations: 
 http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf 
 (Sections 4.3 and 5.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2096) Remove implicit conversions in Streaming Scala API


[ 
https://issues.apache.org/jira/browse/FLINK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560700#comment-14560700
 ] 

ASF GitHub Bot commented on FLINK-2096:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/726#issuecomment-105841951
  
+1 from my side.

I think it is important that the core APIs let you see clearly what is 
happening, otherwise new users usually get thrown off.



 Remove implicit conversions in Streaming Scala API
 --

 Key: FLINK-2096
 URL: https://issues.apache.org/jira/browse/FLINK-2096
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Aljoscha Krettek
Assignee: Aljoscha Krettek

 At least one user ran into a problem with this: 
 http://stackoverflow.com/questions/30461809/in-flink-stream-windowing-does-not-seem-to-work
 Also, the implicit conversions from Java DataStreams to Scala Streams could 
 be problematic. I think we should at least make them private to the flink 
 package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs


[ 
https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560614#comment-14560614
 ] 

ASF GitHub Bot commented on FLINK-1319:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-105816818
  
Looks very nice, seems to have a good test coverage as well.

How well does it work with bytecode generated by the Scala Compiler?


 Add static code analysis for UDFs
 -

 Key: FLINK-1319
 URL: https://issues.apache.org/jira/browse/FLINK-1319
 Project: Flink
  Issue Type: New Feature
  Components: Java API, Scala API
Reporter: Stephan Ewen
Assignee: Timo Walther
Priority: Minor

 Flink's Optimizer takes information that tells it for UDFs which fields of 
 the input elements are accessed, modified, or frwarded/copied. This 
 information frequently helps to reuse partitionings, sorts, etc. It may speed 
 up programs significantly, as it can frequently eliminate sorts and shuffles, 
 which are costly.
 Right now, users can add lightweight annotations to UDFs to provide this 
 information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}.
 We worked with static code analysis of UDFs before, to determine this 
 information automatically. This is an incredible feature, as it magically 
 makes programs faster.
 For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this 
 works surprisingly well in many cases. We used the Soot toolkit for the 
 static code analysis. Unfortunately, Soot is LGPL licensed and thus we did 
 not include any of the code so far.
 I propose to add this functionality to Flink, in the form of a drop-in 
 addition, to work around the LGPL incompatibility with ALS 2.0. Users could 
 simply download a special flink-code-analysis.jar and drop it into the 
 lib folder to enable this functionality. We may even add a script to 
 tools that downloads that library automatically into the lib folder. This 
 should be legally fine, since we do not redistribute LGPL code and only 
 dynamically link it (the incompatibility with ASL 2.0 is mainly in the 
 patentability, if I remember correctly).
 Prior work on this has been done by [~aljoscha] and [~skunert], which could 
 provide a code base to start with.
 *Appendix*
 Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/
 Papers on static analysis and for optimization: 
 http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and 
 http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
 Quick introduction to the Optimizer: 
 http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf 
 (Section 6)
 Optimizer for Iterations: 
 http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf 
 (Sections 4.3 and 5.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1319][core] Add static code analysis fo...

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-105816818
  
Looks very nice, seems to have a good test coverage as well.

How well does it work with bytecode generated by the Scala Compiler?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: basic TfidfTransformer

2015-05-27 Thread aalexandrov

Github user aalexandrov commented on a diff in the pull request:

https://github.com/apache/flink/pull/730#discussion_r31114073
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala
 ---
@@ -0,0 +1,107 @@
+package org.apache.flink.ml.feature
+
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer}
+import org.apache.flink.ml.math.SparseVector
+
+import scala.math.log
+import scala.util.hashing.MurmurHash3
+import scala.collection.mutable.LinkedHashSet;
+
+/**
+ * This transformer calcuates the term-frequency times inverse document 
frequency for the give DataSet of documents.
+ * The DataSet will be treated as the corpus of documents.
+ * p
+ * The TF is the frequence of a word inside one document
+ * p
+ * The IDF for a word is calculated: log(total number of documents / 
documents that contain the word) + 1
+ * p
+ * This transformer returns a SparseVector where the index is the hash of 
the word and the value the tf-idf.
+ * @author Ronny BrÃ¤unlich
+ * @author Vassil Dimov
+ * @author Filip Perisic
+ */
+class TfIdfTransformer extends Transformer[(Int, Seq[String]), (Int, 
SparseVector)] {
+
+  override def transform(input: DataSet[(Int /* docId */ , Seq[String] 
/*The document */ )], transformParameters: ParameterMap): DataSet[(Int, 
SparseVector)] = {
+
+val params = transformParameters.get(StopWordParameter)
+
+// Here we will store the words in he form (docId, word, count)
+// Count represent the occurrence of word in document docId
+val wordCounts = input
+  //count the words
+  .flatMap(t = {
+  //create tuples docId, word, 1
+  t._2.map(s = (t._1, s, 1))
+})
+  .filter(t = 
!transformParameters.apply(StopWordParameter).contains(t._2))
+  //group by document and word
+  .groupBy(0, 1)
+  // calculate the occurrence count of each word in specific document
+  .sum(2)
+
+val dictionary = wordCounts
+  .map(t = LinkedHashSet(t._2))
+  .reduce((set1, set2) = set1 ++ set2)
+  .map(set = set.zipWithIndex.toMap)
+  .flatMap(m = m.toList)
+
+val numberOfWords = wordCounts
+  .map(t = (t._2))
+  .distinct(t = t)
+  .map(t = 1)
+  .reduce(_ + _);
+
+val idf: DataSet[(String, Double)] = calculateIDF(wordCounts)
+val tf: DataSet[(Int, String, Int)] = wordCounts
+
+// docId, word, tfIdf
+val tfIdf = tf.join(idf).where(1).equalTo(0) {
+  (t1, t2) = (t1._1, t1._2, t1._3.toDouble * t2._2)
+}
+
+val res = tfIdf.crossWithTiny(numberOfWords)
+  // docId, word, tfIdf, numberOfWords
+  .map(t = (t._1._1, t._1._2, t._1._3, t._2))
+  //assign every word its position
+  .joinWithHuge(dictionary).where(1).equalTo(0)
--- End diff --

Use `.joinWithTiny` or `.join` or a broadcast variable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [core] cleanup tests for FileInputFormat

2015-05-27 Thread mxm

GitHub user mxm opened a pull request:

https://github.com/apache/flink/pull/732

[core] cleanup  tests for FileInputFormat

followup of f2891ab857e00bc70eb025bb430f46f4f58355a5

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mxm/flink FileInputFormat

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/732.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #732


commit bf7a90f52aac59751ca325ab3f08cac3585b1ad7
Author: Maximilian Michels m...@apache.org
Date:   2015-05-26T17:59:23Z

[core] cleanup  tests for FileInputFormat

followup of f2891ab857e00bc70eb025bb430f46f4f58355a5




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2004] Fix memory leak in presence of fa...

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/674#issuecomment-105846281
  
What is the status of this pull request? Any pending changes, or is it 
blocked by anything?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2004) Memory leak in presence of failed checkpoints in KafkaSource


[ 
https://issues.apache.org/jira/browse/FLINK-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560713#comment-14560713
 ] 

ASF GitHub Bot commented on FLINK-2004:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/674#issuecomment-105846281
  
What is the status of this pull request? Any pending changes, or is it 
blocked by anything?


 Memory leak in presence of failed checkpoints in KafkaSource
 

 Key: FLINK-2004
 URL: https://issues.apache.org/jira/browse/FLINK-2004
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Robert Metzger
Priority: Critical
 Fix For: 0.9


 Checkpoints that fail never send a commit message to the tasks.
 Maintaining a map of all pending checkpoints introduces a memory leak, as 
 entries for failed checkpoints will never be removed.
 Approaches to fix this:
   - The source cleans up entries from older checkpoints once a checkpoint is 
 committed (simple implementation in a linked hash map)
   - The commit message could include the optional state handle (source needs 
 not maintain the map)
   - The checkpoint coordinator could send messages for failed checkpoints?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-27 Thread HilmiYildirim

Github user HilmiYildirim commented on the pull request:

https://github.com/apache/flink/pull/695#issuecomment-105884995
  
Done. I hope every thing works fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Assigned] (FLINK-687) Add support for outer-joins


 [ 
https://issues.apache.org/jira/browse/FLINK-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chiwan Park reassigned FLINK-687:
-

Assignee: Chiwan Park

 Add support for outer-joins
 ---

 Key: FLINK-687
 URL: https://issues.apache.org/jira/browse/FLINK-687
 Project: Flink
  Issue Type: New Feature
Reporter: GitHub Import
Assignee: Chiwan Park
Priority: Minor
  Labels: github-import
 Fix For: pre-apache


 There are three types of outer-joins:
 - left outer,
 - right outer, and
 - full outer
 joins.
 An outer-join does not filter tuples of the outer-side that do not find a 
 matching tuple on the other side. Instead, it is joined with a NULL value.
 Supporting outer-joins requires some modifications in the join execution 
 strategies.
  Imported from GitHub 
 Url: https://github.com/stratosphere/stratosphere/issues/687
 Created by: [fhueske|https://github.com/fhueske]
 Labels: enhancement, java api, runtime, 
 Created at: Mon Apr 14 12:09:00 CEST 2014
 State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-687) Add support for outer-joins


[ 
https://issues.apache.org/jira/browse/FLINK-687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560864#comment-14560864
 ] 

Chiwan Park commented on FLINK-687:
---

Hi. I want to contribute patch for this issue. I am looking into current 
equi-join implementation. I think we need two works to implement outer join. 
One is adding the apis in DataSet (called {{leftOuterJoin}}, 
{{rightOuterJoin}}, {{fullOuterJoin}}) and the other is updating classes 
derived from {{JoinTaskIterator}}.

Are there any things to be careful?

 Add support for outer-joins
 ---

 Key: FLINK-687
 URL: https://issues.apache.org/jira/browse/FLINK-687
 Project: Flink
  Issue Type: New Feature
Reporter: GitHub Import
Assignee: Chiwan Park
Priority: Minor
  Labels: github-import
 Fix For: pre-apache


 There are three types of outer-joins:
 - left outer,
 - right outer, and
 - full outer
 joins.
 An outer-join does not filter tuples of the outer-side that do not find a 
 matching tuple on the other side. Instead, it is joined with a NULL value.
 Supporting outer-joins requires some modifications in the join execution 
 strategies.
  Imported from GitHub 
 Url: https://github.com/stratosphere/stratosphere/issues/687
 Created by: [fhueske|https://github.com/fhueske]
 Labels: enhancement, java api, runtime, 
 Created at: Mon Apr 14 12:09:00 CEST 2014
 State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2073] [ml] [docs] Adds contribution gui...

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/727#issuecomment-105891978
  
Will merge it then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs


[ 
https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560921#comment-14560921
 ] 

ASF GitHub Bot commented on FLINK-1319:
---

Github user twalthr commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-105897738
  
The bytecode generated from Scala Compiler is the same. That is not a 
problem for the analyzer. But Scala is not fully supported yet, because of the 
different Java/Scala Tuples (fields starting with _ instead of f etc.). I 
will add support for that in the near future.

If any exceptions are thrown, they are only visible in the debug log. So we 
could merge it into the 0.9 release without any disadvantages.

It builds now ;)


 Add static code analysis for UDFs
 -

 Key: FLINK-1319
 URL: https://issues.apache.org/jira/browse/FLINK-1319
 Project: Flink
  Issue Type: New Feature
  Components: Java API, Scala API
Reporter: Stephan Ewen
Assignee: Timo Walther
Priority: Minor

 Flink's Optimizer takes information that tells it for UDFs which fields of 
 the input elements are accessed, modified, or frwarded/copied. This 
 information frequently helps to reuse partitionings, sorts, etc. It may speed 
 up programs significantly, as it can frequently eliminate sorts and shuffles, 
 which are costly.
 Right now, users can add lightweight annotations to UDFs to provide this 
 information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}.
 We worked with static code analysis of UDFs before, to determine this 
 information automatically. This is an incredible feature, as it magically 
 makes programs faster.
 For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this 
 works surprisingly well in many cases. We used the Soot toolkit for the 
 static code analysis. Unfortunately, Soot is LGPL licensed and thus we did 
 not include any of the code so far.
 I propose to add this functionality to Flink, in the form of a drop-in 
 addition, to work around the LGPL incompatibility with ALS 2.0. Users could 
 simply download a special flink-code-analysis.jar and drop it into the 
 lib folder to enable this functionality. We may even add a script to 
 tools that downloads that library automatically into the lib folder. This 
 should be legally fine, since we do not redistribute LGPL code and only 
 dynamically link it (the incompatibility with ASL 2.0 is mainly in the 
 patentability, if I remember correctly).
 Prior work on this has been done by [~aljoscha] and [~skunert], which could 
 provide a code base to start with.
 *Appendix*
 Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/
 Papers on static analysis and for optimization: 
 http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and 
 http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
 Quick introduction to the Optimizer: 
 http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf 
 (Section 6)
 Optimizer for Iterations: 
 http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf 
 (Sections 4.3 and 5.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (FLINK-2061) CSVReader: quotedStringParsing and includeFields yields ParseException


 [ 
https://issues.apache.org/jira/browse/FLINK-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chiwan Park reassigned FLINK-2061:
--

Assignee: Chiwan Park

 CSVReader: quotedStringParsing and includeFields yields ParseException
 --

 Key: FLINK-2061
 URL: https://issues.apache.org/jira/browse/FLINK-2061
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: Chiwan Park

 Fields in a CSV file with quoted String cannot be skipped.
 Parsing a line such as: 
 {code}
 20:41:52-1-3-2015|Re: Taskmanager memory error in Eclipse|Stephan Ewen 
 se...@apache.org|bla|blubb
 {code}
 with a CSVReader configured as: 
 {code}
 DataSetTuple2String, String data =
   env.readCsvFile(/path/to/my/data)
   .lineDelimiter(\n)
   .fieldDelimiter(|)
   .parseQuotedStrings('')
   .includeFields(101)
   .types(String.class, String.class);
 {code}
 gives a {{ParseException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2061) CSVReader: quotedStringParsing and includeFields yields ParseException


[ 
https://issues.apache.org/jira/browse/FLINK-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560962#comment-14560962
 ] 

ASF GitHub Bot commented on FLINK-2061:
---

GitHub user chiwanpark opened a pull request:

https://github.com/apache/flink/pull/734

[FLINK-2061] CSVReader: quotedStringParsing and includeFields yields 
ParseException

Fix the bug in `GenericCsvInputFormat` when skipped field is quoted string. 
I also added a unit test for this case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chiwanpark/flink FLINK-2061

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/734.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #734


commit 99fb79beda88c73d80e630aa5e22e9ee401538ed
Author: Chiwan Park chiwanp...@icloud.com
Date:   2015-05-27T13:24:59Z

[FLINK-2061] [java api] Fix GenericCsvInputFormat skipping fields error 
with quoted string




 CSVReader: quotedStringParsing and includeFields yields ParseException
 --

 Key: FLINK-2061
 URL: https://issues.apache.org/jira/browse/FLINK-2061
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: Chiwan Park

 Fields in a CSV file with quoted String cannot be skipped.
 Parsing a line such as: 
 {code}
 20:41:52-1-3-2015|Re: Taskmanager memory error in Eclipse|Stephan Ewen 
 se...@apache.org|bla|blubb
 {code}
 with a CSVReader configured as: 
 {code}
 DataSetTuple2String, String data =
   env.readCsvFile(/path/to/my/data)
   .lineDelimiter(\n)
   .fieldDelimiter(|)
   .parseQuotedStrings('')
   .includeFields(101)
   .types(String.class, String.class);
 {code}
 gives a {{ParseException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1874] [streaming] Connector breakup

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/719#issuecomment-105914769
  
I'm not so sure about the module names.
The parent is called.
`flink-streaming-connectors-parent`
The modules itself are called:
```diff
+   modules
+   moduleflink-flume-connector/module
+   moduleflink-kafka-connector/module
+   moduleflink-rabbitmq-connector/module
+   moduleflink-twitter-connector/module
+   /modules
```

It would be more logical to call them:
`flink-streaming-connectors-flume`, `flink-streaming-connectors-kafka`.

Including `streaming` into the name would make it clear that these 
connectors are not usable with the batch system.
For `kafka` we might want to provide a batch source at some point (we do 
already through the kafka hadoop input format).

I think I'm slightly in favor of calling them 
`flink-streaming-connectors-flume`.
Any other opinions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-2100) Add ITCases for all Table API examples

Timo Walther created FLINK-2100:
---

 Summary: Add ITCases for all Table API examples
 Key: FLINK-2100
 URL: https://issues.apache.org/jira/browse/FLINK-2100
 Project: Flink
  Issue Type: New Feature
  Components: Table API
Reporter: Timo Walther
Priority: Minor


Not all examples in the Table API are tested with ITCases. They should be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1874) Break up streaming connectors into submodules


[ 
https://issues.apache.org/jira/browse/FLINK-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560975#comment-14560975
 ] 

ASF GitHub Bot commented on FLINK-1874:
---

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/719#issuecomment-105914769
  
I'm not so sure about the module names.
The parent is called.
`flink-streaming-connectors-parent`
The modules itself are called:
```diff
+   modules
+   moduleflink-flume-connector/module
+   moduleflink-kafka-connector/module
+   moduleflink-rabbitmq-connector/module
+   moduleflink-twitter-connector/module
+   /modules
```

It would be more logical to call them:
`flink-streaming-connectors-flume`, `flink-streaming-connectors-kafka`.

Including `streaming` into the name would make it clear that these 
connectors are not usable with the batch system.
For `kafka` we might want to provide a batch source at some point (we do 
already through the kafka hadoop input format).

I think I'm slightly in favor of calling them 
`flink-streaming-connectors-flume`.
Any other opinions?


 Break up streaming connectors into submodules
 -

 Key: FLINK-1874
 URL: https://issues.apache.org/jira/browse/FLINK-1874
 Project: Flink
  Issue Type: Task
  Components: Build System, Streaming
Affects Versions: 0.9
Reporter: Robert Metzger
Assignee: Márton Balassi
  Labels: starter

 As per: 
 http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Break-up-streaming-connectors-into-subprojects-td5001.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-2100) Add ITCases for all Table API examples


 [ 
https://issues.apache.org/jira/browse/FLINK-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timo Walther updated FLINK-2100:

Labels: starter  (was: )

 Add ITCases for all Table API examples
 --

 Key: FLINK-2100
 URL: https://issues.apache.org/jira/browse/FLINK-2100
 Project: Flink
  Issue Type: Test
  Components: Table API
Reporter: Timo Walther
Priority: Minor
  Labels: starter

 Not all examples in the Table API are tested with ITCases. They should be 
 added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-2100) Add ITCases for all Table API examples


 [ 
https://issues.apache.org/jira/browse/FLINK-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timo Walther updated FLINK-2100:

Issue Type: Test  (was: New Feature)

 Add ITCases for all Table API examples
 --

 Key: FLINK-2100
 URL: https://issues.apache.org/jira/browse/FLINK-2100
 Project: Flink
  Issue Type: Test
  Components: Table API
Reporter: Timo Walther
Priority: Minor
  Labels: starter

 Not all examples in the Table API are tested with ITCases. They should be 
 added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (FLINK-1905) Add tests for parallelism 1 for Kafka sources


 [ 
https://issues.apache.org/jira/browse/FLINK-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger resolved FLINK-1905.
---
Resolution: Fixed

Resolved with the recently added tests.

 Add tests for parallelism  1 for Kafka sources
 ---

 Key: FLINK-1905
 URL: https://issues.apache.org/jira/browse/FLINK-1905
 Project: Flink
  Issue Type: Sub-task
  Components: Kafka Connector, Streaming
Affects Versions: 0.9
Reporter: Robert Metzger
Assignee: Robert Metzger

 Also, test with multiple partitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-687) Add support for outer-joins


[ 
https://issues.apache.org/jira/browse/FLINK-687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561009#comment-14561009
 ] 

Chiwan Park commented on FLINK-687:
---

Yes. Splitting this issue into two sub issues is reasonable. :)

I think the outer join algorithm based HashJoin should implemented a class 
extending {{HashMatchIteratorBase}} and implementing JoinTaskIterator. Is it 
right?

I have another question. There are many match iterators in 
{{org.apache.flink.runtime.operators.hash}} package. I know the difference 
between {{First}} and {{Second}} but don't know the difference {{Reusing}} and 
{{NonReusing}} and the meaning of {{ReOpenable}}. Could you explain me about 
this keyword?

If my approach is wrong, please advise me.
Thank you :)

 Add support for outer-joins
 ---

 Key: FLINK-687
 URL: https://issues.apache.org/jira/browse/FLINK-687
 Project: Flink
  Issue Type: New Feature
Reporter: GitHub Import
Assignee: Chiwan Park
Priority: Minor
  Labels: github-import
 Fix For: pre-apache


 There are three types of outer-joins:
 - left outer,
 - right outer, and
 - full outer
 joins.
 An outer-join does not filter tuples of the outer-side that do not find a 
 matching tuple on the other side. Instead, it is joined with a NULL value.
 Supporting outer-joins requires some modifications in the join execution 
 strategies.
  Imported from GitHub 
 Url: https://github.com/stratosphere/stratosphere/issues/687
 Created by: [fhueske|https://github.com/fhueske]
 Labels: enhancement, java api, runtime, 
 Created at: Mon Apr 14 12:09:00 CEST 2014
 State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1962) Add Gelly Scala API

2015-05-27 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560828#comment-14560828
 ] 

Aljoscha Krettek commented on FLINK-1962:
-

I'm afraid I don't have a better solution than the one you came up with. As 
long as there is this difference between Scala Tuples and Java tuples the only 
other solution is a complete reimplementation.

 Add Gelly Scala API
 ---

 Key: FLINK-1962
 URL: https://issues.apache.org/jira/browse/FLINK-1962
 Project: Flink
  Issue Type: Task
  Components: Gelly, Scala API
Affects Versions: 0.9
Reporter: Vasia Kalavri
Assignee: PJ Van Aeken





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1952] [jobmanager] Rework and fix slot ...

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/731#discussion_r31128102
  
--- Diff: flink-tests/src/test/resources/log4j.properties ---
@@ -0,0 +1,32 @@

+
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  License); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an AS IS BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.

+
+
+# Set root logger level to OFF to not flood build logs
+# set manually to INFO for debugging purposes
+log4j.rootLogger=INFO, testlogger
+
+# A1 is set to be a ConsoleAppender.
+log4j.appender.testlogger=org.apache.log4j.ConsoleAppender
+log4j.appender.testlogger.target = System.err
+log4j.appender.testlogger.layout=org.apache.log4j.PatternLayout
+log4j.appender.testlogger.layout.ConversionPattern=%-4r [%t] %-5p %c %x - 
%m%n
+
+log4j.logger.org.apache.flink.runtime.client.JobClient=OFF
+
+# suppress the irrelevant (wrong) warnings from the netty channel handler
+log4j.logger.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, 
testlogger
--- End diff --

Was this file added for debugging purposes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1952) Cannot run ConnectedComponents example: Could not allocate a slot on instance


[ 
https://issues.apache.org/jira/browse/FLINK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560877#comment-14560877
 ] 

ASF GitHub Bot commented on FLINK-1952:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/731#discussion_r31128102
  
--- Diff: flink-tests/src/test/resources/log4j.properties ---
@@ -0,0 +1,32 @@

+
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  License); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an AS IS BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.

+
+
+# Set root logger level to OFF to not flood build logs
+# set manually to INFO for debugging purposes
+log4j.rootLogger=INFO, testlogger
+
+# A1 is set to be a ConsoleAppender.
+log4j.appender.testlogger=org.apache.log4j.ConsoleAppender
+log4j.appender.testlogger.target = System.err
+log4j.appender.testlogger.layout=org.apache.log4j.PatternLayout
+log4j.appender.testlogger.layout.ConversionPattern=%-4r [%t] %-5p %c %x - 
%m%n
+
+log4j.logger.org.apache.flink.runtime.client.JobClient=OFF
+
+# suppress the irrelevant (wrong) warnings from the netty channel handler
+log4j.logger.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, 
testlogger
--- End diff --

Was this file added for debugging purposes?


 Cannot run ConnectedComponents example: Could not allocate a slot on instance
 -

 Key: FLINK-1952
 URL: https://issues.apache.org/jira/browse/FLINK-1952
 Project: Flink
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 0.9
Reporter: Robert Metzger
Priority: Blocker

 Steps to reproduce
 {code}
 ./bin/yarn-session.sh -n 350 
 {code}
 ... wait until they are connected ...
 {code}
 Number of connected TaskManagers changed to 266. Slots available: 266
 Number of connected TaskManagers changed to 323. Slots available: 323
 Number of connected TaskManagers changed to 334. Slots available: 334
 Number of connected TaskManagers changed to 343. Slots available: 343
 Number of connected TaskManagers changed to 350. Slots available: 350
 {code}
 Start CC
 {code}
 ./bin/flink run -p 350 
 ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar
 {code}
 --- it runs
 Run KMeans, let it fail with 
 {code}
 Failed to deploy the task Map (Map at main(KMeans.java:100)) (1/350) - 
 execution #0 to slot SimpleSlot (2)(2)(0) - 182b7661ca9547a84591de940c47a200 
 - ALLOCATED/ALIVE: java.io.IOException: Insufficient number of network 
 buffers: required 350, but only 254 available. The total number of network 
 buffers is currently set to 2048. You can increase this number by setting the 
 configuration key 'taskmanager.network.numberOfBuffers'.
 {code}
 ... as expected.
 (I've waited for 10 minutes between the two submissions)
 Starting CC now will fail:
 {code}
 ./bin/flink run -p 350 
 ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar 
 {code}
 Error message(s):
 {code}
 Caused by: java.lang.IllegalStateException: Could not schedule consumer 
 vertex IterationHead(WorksetIteration (Unnamed Delta Iteration)) (19/350)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:479)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:469)
   at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
   ... 4 more
 Caused by: 
 org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
 Could not allocate a slot on instance 4a6d761cb084c32310ece1f849556faf @ 
 cloud-19 - 1 slots - URL:

[jira] [Commented] (FLINK-2003) Building on some encrypted filesystems leads to File name too long error


[ 
https://issues.apache.org/jira/browse/FLINK-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560871#comment-14560871
 ] 

ASF GitHub Bot commented on FLINK-2003:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/690#issuecomment-105890348
  
OK, I'll close this PR and open up another one with the doc fixes.


 Building on some encrypted filesystems leads to File name too long error
 --

 Key: FLINK-2003
 URL: https://issues.apache.org/jira/browse/FLINK-2003
 Project: Flink
  Issue Type: Bug
  Components: Build System
Reporter: Theodore Vasiloudis
Priority: Minor
  Labels: build, starter

 The classnames generated from the build system can be too long.
 Creating too long filenames in some encrypted filesystems is not possible, 
 including encfs which is what Ubuntu uses.
 This the same as this [Spark 
 issue|https://issues.apache.org/jira/browse/SPARK-4820]
 The workaround (taken from the linked issue) is to add in Maven under the 
 compile options: 
 {code}
 +  arg-Xmax-classfile-name/arg
 +  arg128/arg
 {code}
 And in SBT add:
 {code}
 +scalacOptions in Compile ++= Seq(-Xmax-classfile-name, 128),
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2003) Building on some encrypted filesystems leads to File name too long error


[ 
https://issues.apache.org/jira/browse/FLINK-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560872#comment-14560872
 ] 

ASF GitHub Bot commented on FLINK-2003:
---

Github user thvasilo closed the pull request at:

https://github.com/apache/flink/pull/690


 Building on some encrypted filesystems leads to File name too long error
 --

 Key: FLINK-2003
 URL: https://issues.apache.org/jira/browse/FLINK-2003
 Project: Flink
  Issue Type: Bug
  Components: Build System
Reporter: Theodore Vasiloudis
Priority: Minor
  Labels: build, starter

 The classnames generated from the build system can be too long.
 Creating too long filenames in some encrypted filesystems is not possible, 
 including encfs which is what Ubuntu uses.
 This the same as this [Spark 
 issue|https://issues.apache.org/jira/browse/SPARK-4820]
 The workaround (taken from the linked issue) is to add in Maven under the 
 compile options: 
 {code}
 +  arg-Xmax-classfile-name/arg
 +  arg128/arg
 {code}
 And in SBT add:
 {code}
 +scalacOptions in Compile ++= Seq(-Xmax-classfile-name, 128),
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2003] Building on some encrypted filesy...

Github user thvasilo closed the pull request at:

https://github.com/apache/flink/pull/690


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2003] Building on some encrypted filesy...

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/690#issuecomment-105890348
  
OK, I'll close this PR and open up another one with the doc fixes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-687) Add support for outer-joins

2015-05-27 Thread Fabian Hueske (JIRA)

[
https://issues.apache.org/jira/browse/FLINK-687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560892#comment-14560892
]

Fabian Hueske commented on FLINK-687:
-

Hi [~chiwanpark], this would be a great addition to Flink!
Although, this is not an easy thing I am confident that we can do it :-)

Adding outer joins touches many parts of the system's core:

- Join Algorithms
- Execution strategies
- Optimizer
- API

I would suggest to split this issue into at least two sub-issues.

1. implement a HashJoin based join algorithm for left/right outer joins (no
full outer join yet) and back it by unit tests.
2. add outer joins to the API, go through the optimizer, and finally configure
the correct runtime code.

Maybe this issue can be broken down even further.

What do you think?

Add support for outer-joins
---

Key: FLINK-687
URL: https://issues.apache.org/jira/browse/FLINK-687
Project: Flink
Issue Type: New Feature
Reporter: GitHub Import
Assignee: Chiwan Park
Priority: Minor
Labels: github-import
Fix For: pre-apache

There are three types of outer-joins:
- left outer,
- right outer, and
- full outer
joins.
An outer-join does not filter tuples of the outer-side that do not find a
matching tuple on the other side. Instead, it is joined with a NULL value.
Supporting outer-joins requires some modifications in the join execution
strategies.
Imported from GitHub
Url: https://github.com/stratosphere/stratosphere/issues/687
Created by: [fhueske|https://github.com/fhueske]
Labels: enhancement, java api, runtime,
Created at: Mon Apr 14 12:09:00 CEST 2014
State: open

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1319][core] Add static code analysis fo...

2015-05-27 Thread twalthr

Github user twalthr commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-105897738
  
The bytecode generated from Scala Compiler is the same. That is not a 
problem for the analyzer. But Scala is not fully supported yet, because of the 
different Java/Scala Tuples (fields starting with _ instead of f etc.). I 
will add support for that in the near future.

If any exceptions are thrown, they are only visible in the debug log. So we 
could merge it into the 0.9 release without any disadvantages.

It builds now ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [scheduling] implement backtracking of interme...

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/640#discussion_r31130185
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResultPartitionManager.java
 ---
@@ -40,9 +44,18 @@
 
private static final Logger LOG = 
LoggerFactory.getLogger(ResultPartitionManager.class);
 
+   /**
+* Table of running/finished ResultPartitions with corresponding IDs
+*/
public final TableExecutionAttemptID, IntermediateResultPartitionID, 
ResultPartition
registeredPartitions = HashBasedTable.create();
 
+   /**
+* Cached ResultPartitions which are used to resume/recover from
--- End diff --

Let's add a comment that the LinkedHashMap implements a LRU policy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [scheduling] implement backtracking of interme...

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/640#discussion_r31136509
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java
 ---
@@ -264,21 +266,21 @@ public void attachJobGraph(ListAbstractJobVertex 
topologiallySorted) throws Jo
LOG.debug(String.format(Attaching %d topologically 
sorted vertices to existing job graph with %d 
--- End diff --

We have to check that the parallelism of both vertices are the same, 
because the runtime results are already partitioned. I think a good error msg 
is the best solution at this point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1952) Cannot run ConnectedComponents example: Could not allocate a slot on instance


[ 
https://issues.apache.org/jira/browse/FLINK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560860#comment-14560860
 ] 

ASF GitHub Bot commented on FLINK-1952:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/731#discussion_r31127537
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/instance/SharedSlotsTest.java
 ---
@@ -0,0 +1,665 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.instance;
+
+import org.apache.flink.api.common.JobID;
+import org.apache.flink.runtime.jobgraph.AbstractJobVertex;
+import org.apache.flink.runtime.jobgraph.JobVertexID;
+import org.apache.flink.runtime.jobmanager.scheduler.CoLocationConstraint;
+import org.apache.flink.runtime.jobmanager.scheduler.CoLocationGroup;
+import org.apache.flink.runtime.jobmanager.scheduler.Locality;
+import org.apache.flink.runtime.jobmanager.scheduler.SchedulerTestUtils;
+import org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroup;
+import org.apache.flink.util.AbstractID;
+
+import org.junit.Test;
+
+import java.util.Collections;
+
+import static org.junit.Assert.*;
+
+/**
+ * Tests for the allocation, properties, and release of shared slots.
+ */
+public class SharedSlotsTest {
+   
+   @Test
+   public void allocateAndReleaseEmptySlot() {
+   try {
+   JobID jobId = new JobID();
+   JobVertexID vertexId = new JobVertexID();
+   
+   SlotSharingGroup sharingGroup = new 
SlotSharingGroup(vertexId);
+   SlotSharingGroupAssignment assignment = 
sharingGroup.getTaskAssignment();
+   
+   assertEquals(0, assignment.getNumberOfSlots());
+   assertEquals(0, 
assignment.getNumberOfAvailableSlotsForGroup(vertexId));
+   
+   Instance instance = 
SchedulerTestUtils.getRandomInstance(2);
+   
+   assertEquals(2, instance.getTotalNumberOfSlots());
+   assertEquals(0, instance.getNumberOfAllocatedSlots());
+   assertEquals(2, instance.getNumberOfAvailableSlots());
+   
+   // allocate a shared slot
+   SharedSlot slot = instance.allocateSharedSlot(jobId, 
assignment);
+   assertEquals(2, instance.getTotalNumberOfSlots());
+   assertEquals(1, instance.getNumberOfAllocatedSlots());
+   assertEquals(1, instance.getNumberOfAvailableSlots());
+   
+   // check that the new slot is fresh
+   assertTrue(slot.isAlive());
+   assertFalse(slot.isCanceled());
+   assertFalse(slot.isReleased());
+   assertEquals(0, slot.getNumberLeaves());
+   assertFalse(slot.hasChildren());
+   assertTrue(slot.isRootAndEmpty());
+   assertNotNull(slot.toString());
+   assertTrue(slot.getSubSlots().isEmpty());
+   assertEquals(0, slot.getSlotNumber());
+   assertEquals(0, slot.getRootSlotNumber());
+   
+   // release the slot immediately.
+   slot.releaseSlot();
+
+   assertTrue(slot.isCanceled());
+   assertTrue(slot.isReleased());
+   
+   // the slot sharing group and instance should not
+   assertEquals(2, instance.getTotalNumberOfSlots());
+   assertEquals(0, instance.getNumberOfAllocatedSlots());
+   assertEquals(2, instance.getNumberOfAvailableSlots());
+
+

[GitHub] flink pull request: [FLINK-1952] [jobmanager] Rework and fix slot ...

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/731#discussion_r31127537
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/instance/SharedSlotsTest.java
 ---
@@ -0,0 +1,665 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.instance;
+
+import org.apache.flink.api.common.JobID;
+import org.apache.flink.runtime.jobgraph.AbstractJobVertex;
+import org.apache.flink.runtime.jobgraph.JobVertexID;
+import org.apache.flink.runtime.jobmanager.scheduler.CoLocationConstraint;
+import org.apache.flink.runtime.jobmanager.scheduler.CoLocationGroup;
+import org.apache.flink.runtime.jobmanager.scheduler.Locality;
+import org.apache.flink.runtime.jobmanager.scheduler.SchedulerTestUtils;
+import org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroup;
+import org.apache.flink.util.AbstractID;
+
+import org.junit.Test;
+
+import java.util.Collections;
+
+import static org.junit.Assert.*;
+
+/**
+ * Tests for the allocation, properties, and release of shared slots.
+ */
+public class SharedSlotsTest {
+   
+   @Test
+   public void allocateAndReleaseEmptySlot() {
+   try {
+   JobID jobId = new JobID();
+   JobVertexID vertexId = new JobVertexID();
+   
+   SlotSharingGroup sharingGroup = new 
SlotSharingGroup(vertexId);
+   SlotSharingGroupAssignment assignment = 
sharingGroup.getTaskAssignment();
+   
+   assertEquals(0, assignment.getNumberOfSlots());
+   assertEquals(0, 
assignment.getNumberOfAvailableSlotsForGroup(vertexId));
+   
+   Instance instance = 
SchedulerTestUtils.getRandomInstance(2);
+   
+   assertEquals(2, instance.getTotalNumberOfSlots());
+   assertEquals(0, instance.getNumberOfAllocatedSlots());
+   assertEquals(2, instance.getNumberOfAvailableSlots());
+   
+   // allocate a shared slot
+   SharedSlot slot = instance.allocateSharedSlot(jobId, 
assignment);
+   assertEquals(2, instance.getTotalNumberOfSlots());
+   assertEquals(1, instance.getNumberOfAllocatedSlots());
+   assertEquals(1, instance.getNumberOfAvailableSlots());
+   
+   // check that the new slot is fresh
+   assertTrue(slot.isAlive());
+   assertFalse(slot.isCanceled());
+   assertFalse(slot.isReleased());
+   assertEquals(0, slot.getNumberLeaves());
+   assertFalse(slot.hasChildren());
+   assertTrue(slot.isRootAndEmpty());
+   assertNotNull(slot.toString());
+   assertTrue(slot.getSubSlots().isEmpty());
+   assertEquals(0, slot.getSlotNumber());
+   assertEquals(0, slot.getRootSlotNumber());
+   
+   // release the slot immediately.
+   slot.releaseSlot();
+
+   assertTrue(slot.isCanceled());
+   assertTrue(slot.isReleased());
+   
+   // the slot sharing group and instance should not
+   assertEquals(2, instance.getTotalNumberOfSlots());
+   assertEquals(0, instance.getNumberOfAllocatedSlots());
+   assertEquals(2, instance.getNumberOfAvailableSlots());
+
+   assertEquals(0, assignment.getNumberOfSlots());
+   assertEquals(0, 
assignment.getNumberOfAvailableSlotsForGroup(vertexId));
+   
+   // we should not be able to allocate any children from 
this released

[jira] [Commented] (FLINK-2098) Checkpoint barrier initiation at source is not aligned with snapshotting

2015-05-27 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560901#comment-14560901
 ] 

Aljoscha Krettek commented on FLINK-2098:
-

Is this the reason for the non-deterministic test cases we observed?

 Checkpoint barrier initiation at source is not aligned with snapshotting
 

 Key: FLINK-2098
 URL: https://issues.apache.org/jira/browse/FLINK-2098
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Priority: Blocker
 Fix For: 0.9


 The stream source does not properly align the emission of checkpoint barriers 
 with the drawing of snapshots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [hotfix] Remove execute() after print() in Tab...

2015-05-27 Thread twalthr

Github user twalthr commented on the pull request:

https://github.com/apache/flink/pull/735#issuecomment-105913011
  
I will create an issue and assign it to me. I think I will heavily work on 
the Table API anyway the next time ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [scheduling] implement backtracking of interme...

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/640#discussion_r31136136
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResultPartitionManager.java
 ---
@@ -119,10 +153,50 @@ public void shutdown() {
}
}
 
+   /**
+* Registers and pins a cached ResultPartition that holds the data for 
an IntermediateResultPartition.
+* @param partitionID The IntermediateResultPartitionID to find a 
corresponding ResultPartition for.
+* @param numConsumers The number of consumers that want to access the 
ResultPartition
+* @return true if the registering/pinning succeeded, false otherwise.
+*/
+   public boolean pinCachedResultPartition(IntermediateResultPartitionID 
partitionID, int numConsumers) {
+   synchronized (cachedResultPartitions) {
+   ResultPartition resultPartition = 
cachedResultPartitions.get(partitionID);
+   if (resultPartition != null) {
+   try {
+   // update its least recently used value
+   
updateIntermediateResultPartitionCache(resultPartition);
+
+   synchronized (registeredPartitions) {
+   if 
(!registeredPartitions.containsValue(resultPartition)) {
+   LOG.debug(Registered 
previously cached ResultPartition {}., resultPartition);
+   
registerResultPartition(resultPartition);
+   }
+   }
+
+   for (int i = 0; i  numConsumers; i++) {
+   resultPartition.pin();
--- End diff --

Pinning a result partition, increases the pending references count by the 
total number of subpartitions. We need to adjust it to just increment it for 
each call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Assigned] (FLINK-2072) Add a quickstart guide for FlinkML


 [ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theodore Vasiloudis reassigned FLINK-2072:
--

Assignee: Theodore Vasiloudis

 Add a quickstart guide for FlinkML
 --

 Key: FLINK-2072
 URL: https://issues.apache.org/jira/browse/FLINK-2072
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
 Fix For: 0.9


 We need a quickstart guide that introduces users to the core concepts of 
 FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (FLINK-2073) Add contribution guide for FlinkML

2015-05-27 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2073.

Resolution: Fixed

Added with c77947e94767565cc4f947a4ba86a960258a052a

 Add contribution guide for FlinkML
 --

 Key: FLINK-2073
 URL: https://issues.apache.org/jira/browse/FLINK-2073
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Till Rohrmann
 Fix For: 0.9


 We need a guide for contributions to FlinkML in order to encourage the 
 extension of the library, and provide guidelines for developers.
 One thing that should be included is a step-by-step guide to create a 
 transformer, or other Estimator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [scheduling] implement backtracking of interme...

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/640#discussion_r31130488
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResultPartitionManager.java
 ---
@@ -40,9 +44,18 @@
 
private static final Logger LOG = 
LoggerFactory.getLogger(ResultPartitionManager.class);
 
+   /**
+* Table of running/finished ResultPartitions with corresponding IDs
+*/
public final TableExecutionAttemptID, IntermediateResultPartitionID, 
ResultPartition
registeredPartitions = HashBasedTable.create();
 
+   /**
+* Cached ResultPartitions which are used to resume/recover from
--- End diff --

There are two unused classes LRUCache and LRUCacheMap, which can be 
removed. I think this solution is fine. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [scheduling] implement backtracking of interme...

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/640#discussion_r31131135
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala
 ---
@@ -334,6 +335,14 @@ extends Actor with ActorLogMessages with 
ActorSynchronousLogging {
   }
 }
 
+  /**
+   * Pin a ResultPartition corresponding to an 
IntermediateResultPartition
+   */
+  case LockResultPartition(partitionID, numConsumers) =
+val partitionManager: ResultPartitionManager = 
this.network.getPartitionManager
--- End diff --

@StephanEwen after a disconnect from the JM, the TM will clear the network 
stack components and the partition manager might be null. Is it possible that a 
msg is received after the disconnect, e.g. do we have to check for null here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [hotfix] Remove execute() after print() in Tab...

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/735#issuecomment-105910200
  
I thought we have ITCases for our examples to ensure that they are working?
Maybe we should add those to these examples as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561024#comment-14561024
 ] 

Theodore Vasiloudis commented on FLINK-2072:


So for this I was thinking a guide similar to the one that 
[sklearn|http://scikit-learn.org/stable/tutorial/basic/tutorial.html] has.

We would have a small intro of basic concepts, describe a basic classification 
problem, load a dataset using the SVM loading utilities, and perform the 
classification using SVM.

As an extra step we can first do some feature scaling and then perform the 
classification using a pipeline. Ideally we would like the whole tutorial to be 
done interactively in the Scala shell.

 Add a quickstart guide for FlinkML
 --

 Key: FLINK-2072
 URL: https://issues.apache.org/jira/browse/FLINK-2072
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
 Fix For: 0.9


 We need a quickstart guide that introduces users to the core concepts of 
 FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2098) Checkpoint barrier initiation at source is not aligned with snapshotting

Stephan Ewen created FLINK-2098:
---

 Summary: Checkpoint barrier initiation at source is not aligned 
with snapshotting
 Key: FLINK-2098
 URL: https://issues.apache.org/jira/browse/FLINK-2098
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Priority: Blocker
 Fix For: 0.9


The stream source does not properly align the emission of checkpoint barriers 
with the drawing of snapshots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2099) Add a SQL API

Timo Walther created FLINK-2099:
---

 Summary: Add a SQL API
 Key: FLINK-2099
 URL: https://issues.apache.org/jira/browse/FLINK-2099
 Project: Flink
  Issue Type: New Feature
  Components: Table API
Reporter: Timo Walther
Assignee: Timo Walther


From the mailing list:

Fabian: Flink's Table API is pretty close to what SQL provides. IMO, the best
approach would be to leverage that and build a SQL parser (maybe together
with a logical optimizer) on top of the Table API. Parser (and optimizer)
could be built using Apache Calcite which is providing exactly this.
Since the Table API is still a fairly new component and not very feature
rich, it might make sense to extend and strengthen it before putting
something major on top.

Ted: It would also be relatively simple (I think) to retarget drill to Flink if
Flink doesn't provide enough typing meta-data to do traditional SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1952) Cannot run ConnectedComponents example: Could not allocate a slot on instance


[ 
https://issues.apache.org/jira/browse/FLINK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560879#comment-14560879
 ] 

ASF GitHub Bot commented on FLINK-1952:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/731#issuecomment-105890755
  
LGTM.


 Cannot run ConnectedComponents example: Could not allocate a slot on instance
 -

 Key: FLINK-1952
 URL: https://issues.apache.org/jira/browse/FLINK-1952
 Project: Flink
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 0.9
Reporter: Robert Metzger
Priority: Blocker

 Steps to reproduce
 {code}
 ./bin/yarn-session.sh -n 350 
 {code}
 ... wait until they are connected ...
 {code}
 Number of connected TaskManagers changed to 266. Slots available: 266
 Number of connected TaskManagers changed to 323. Slots available: 323
 Number of connected TaskManagers changed to 334. Slots available: 334
 Number of connected TaskManagers changed to 343. Slots available: 343
 Number of connected TaskManagers changed to 350. Slots available: 350
 {code}
 Start CC
 {code}
 ./bin/flink run -p 350 
 ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar
 {code}
 --- it runs
 Run KMeans, let it fail with 
 {code}
 Failed to deploy the task Map (Map at main(KMeans.java:100)) (1/350) - 
 execution #0 to slot SimpleSlot (2)(2)(0) - 182b7661ca9547a84591de940c47a200 
 - ALLOCATED/ALIVE: java.io.IOException: Insufficient number of network 
 buffers: required 350, but only 254 available. The total number of network 
 buffers is currently set to 2048. You can increase this number by setting the 
 configuration key 'taskmanager.network.numberOfBuffers'.
 {code}
 ... as expected.
 (I've waited for 10 minutes between the two submissions)
 Starting CC now will fail:
 {code}
 ./bin/flink run -p 350 
 ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar 
 {code}
 Error message(s):
 {code}
 Caused by: java.lang.IllegalStateException: Could not schedule consumer 
 vertex IterationHead(WorksetIteration (Unnamed Delta Iteration)) (19/350)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:479)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:469)
   at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
   ... 4 more
 Caused by: 
 org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
 Could not allocate a slot on instance 4a6d761cb084c32310ece1f849556faf @ 
 cloud-19 - 1 slots - URL: 
 akka.tcp://flink@130.149.21.23:51400/user/taskmanager, as required by the 
 co-location constraint.
   at 
 org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:247)
   at 
 org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:110)
   at 
 org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:262)
   at 
 org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:436)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:475)
   ... 9 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1952] [jobmanager] Rework and fix slot ...

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/731#issuecomment-105890755
  
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [scheduling] implement backtracking of interme...

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/640#discussion_r31131305
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResultPartitionManager.java
 ---
@@ -137,9 +211,46 @@ void onConsumedPartition(ResultPartition partition) {
 
// Release the partition if it was successfully removed
if (partition == previous) {
-   partition.release();
+   // move to cache if cachable
+   updateIntermediateResultPartitionCache(partition);
 
-   LOG.debug(Released {}., partition);
+   LOG.debug(Cached {}., partition);
--- End diff --

Let's adjust this log message to reflect that a partition might also be 
released after the update call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2047) Rename CoCoA to SVM


[ 
https://issues.apache.org/jira/browse/FLINK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560988#comment-14560988
 ] 

ASF GitHub Bot commented on FLINK-2047:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/733#issuecomment-105918301
  
LGTM. Will merge it.


 Rename CoCoA to SVM
 ---

 Key: FLINK-2047
 URL: https://issues.apache.org/jira/browse/FLINK-2047
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Trivial
 Fix For: 0.9


 The CoCoA algorithm as implemented functions as an SVM classifier.
 As CoCoA mostly concerns the optimization process and not the actual learning 
 algorithm, it makes sense to rename the learner to SVM which users are more 
 familiar with.
 In the future we would like to use the CoCoA algorithm to solve more large 
 scale optimization problems for other learning algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [scheduling] implement backtracking of interme...

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/640#discussion_r31134168
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResultPartitionManager.java
 ---
@@ -119,10 +153,50 @@ public void shutdown() {
}
}
 
+   /**
+* Registers and pins a cached ResultPartition that holds the data for 
an IntermediateResultPartition.
+* @param partitionID The IntermediateResultPartitionID to find a 
corresponding ResultPartition for.
+* @param numConsumers The number of consumers that want to access the 
ResultPartition
+* @return true if the registering/pinning succeeded, false otherwise.
+*/
+   public boolean pinCachedResultPartition(IntermediateResultPartitionID 
partitionID, int numConsumers) {
+   synchronized (cachedResultPartitions) {
+   ResultPartition resultPartition = 
cachedResultPartitions.get(partitionID);
+   if (resultPartition != null) {
+   try {
+   // update its least recently used value
+   
updateIntermediateResultPartitionCache(resultPartition);
+
+   synchronized (registeredPartitions) {
+   if 
(!registeredPartitions.containsValue(resultPartition)) {
+   LOG.debug(Registered 
previously cached ResultPartition {}., resultPartition);
+   
registerResultPartition(resultPartition);
--- End diff --

I think there is a race between pinning and releasing. It could happen that 
a pinned result is removed by a concurrent release (in the onConsumed callback).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2073) Add contribution guide for FlinkML


[ 
https://issues.apache.org/jira/browse/FLINK-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560980#comment-14560980
 ] 

ASF GitHub Bot commented on FLINK-2073:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/727


 Add contribution guide for FlinkML
 --

 Key: FLINK-2073
 URL: https://issues.apache.org/jira/browse/FLINK-2073
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Till Rohrmann
 Fix For: 0.9


 We need a guide for contributions to FlinkML in order to encourage the 
 extension of the library, and provide guidelines for developers.
 One thing that should be included is a step-by-step guide to create a 
 transformer, or other Estimator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2094) Implement Word2Vec

2015-05-27 Thread Johannes (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561006#comment-14561006
 ] 

Johannes commented on FLINK-2094:
-

Some information below, as I looked into this algorithm before.

There is a pretty performant implementation available for python 
https://radimrehurek.com/gensim/models/word2vec.html

Which has a better annotated source code, as the original source code is very 
hard to read.

Also along similar lines, there is an implementation within the nd4j framework 
in Java, which might be interesting to look at.

http://deeplearning4j.org/word2vec.html

There is also a very good Tutorial by the author of word2vec that describes how 
neural networks can be used 

http://www.coling-2014.org/COLING%202014%20Tutorial-fix%20-%20Tomas%20Mikolov.pdf

 Implement Word2Vec
 --

 Key: FLINK-2094
 URL: https://issues.apache.org/jira/browse/FLINK-2094
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Nikolaas Steenbergen
Assignee: Nikolaas Steenbergen
Priority: Minor

 implement Word2Vec
 http://arxiv.org/pdf/1402.3722v1.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2061] CSVReader: quotedStringParsing an...

2015-05-27 Thread chiwanpark

GitHub user chiwanpark opened a pull request:

https://github.com/apache/flink/pull/734

[FLINK-2061] CSVReader: quotedStringParsing and includeFields yields 
ParseException

Fix the bug in `GenericCsvInputFormat` when skipped field is quoted string. 
I also added a unit test for this case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chiwanpark/flink FLINK-2061

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/734.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #734


commit 99fb79beda88c73d80e630aa5e22e9ee401538ed
Author: Chiwan Park chiwanp...@icloud.com
Date:   2015-05-27T13:24:59Z

[FLINK-2061] [java api] Fix GenericCsvInputFormat skipping fields error 
with quoted string




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [hotfix] Remove execute() after print() in Tab...

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/735#issuecomment-105910800
  
Sorry for the rude comment. Thank you for fixing the issue. Very good catch 
;)
Maybe we should file a JIRA for adding ITCases for these examples ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2047] [ml] Rename CoCoA to SVM

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/733#issuecomment-105918301
  
LGTM. Will merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2004] Fix memory leak in presence of fa...

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/674#issuecomment-105926083
  
Sorry for not reacting. I was busy with other stuff but I'll now address 
your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2073) Add contribution guide for FlinkML


[ 
https://issues.apache.org/jira/browse/FLINK-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560885#comment-14560885
 ] 

ASF GitHub Bot commented on FLINK-2073:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/727#issuecomment-105891978
  
Will merge it then.


 Add contribution guide for FlinkML
 --

 Key: FLINK-2073
 URL: https://issues.apache.org/jira/browse/FLINK-2073
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Till Rohrmann
 Fix For: 0.9


 We need a guide for contributions to FlinkML in order to encourage the 
 extension of the library, and provide guidelines for developers.
 One thing that should be included is a step-by-step guide to create a 
 transformer, or other Estimator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [hotfix] Remove execute() after print() in Tab...

2015-05-27 Thread twalthr

GitHub user twalthr opened a pull request:

https://github.com/apache/flink/pull/735

[hotfix] Remove execute() after print() in Table API examples



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/twalthr/flink tableApiExampleFix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/735.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #735


commit e43cf3f7d92dc4e9d80850e3a80501c4a6e3773e
Author: twalthr twal...@apache.org
Date:   2015-05-27T13:32:11Z

[hotfix] Remove execute() after print() in Table API examples




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1422) Missing usage example for withParameters


[ 
https://issues.apache.org/jira/browse/FLINK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561181#comment-14561181
 ] 

Theodore Vasiloudis commented on FLINK-1422:


Run into this by chance, shouldn't it be closed?

 Missing usage example for withParameters
 --

 Key: FLINK-1422
 URL: https://issues.apache.org/jira/browse/FLINK-1422
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 0.8.0
Reporter: Alexander Alexandrov
Priority: Trivial
 Fix For: 0.8.2

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am struggling to find a usage example of the withParameters method in the 
 documentation. At the moment I only see this note:
 {quote}
 Note: As the content of broadcast variables is kept in-memory on each node, 
 it should not become too large. For simpler things like scalar values you can 
 simply make parameters part of the closure of a function, or use the 
 withParameters(...) method to pass in a configuration.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: implement a simple session management

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/681#issuecomment-105982832
  
Looks very nice now. Few comments:

  - I suggest to keep the JobID in the JobGraph final and force it to be 
passed via the constructor. The JobGraphGenerator would need to accept the 
JobID as a parameter then.
  - Let's keep the actor system in the `Client` alive. The startup of the 
actor system is the longest delay in all client commands. For proper 
interactive behavior, we need to keep it running. Start it on first use, and 
stop it when the client is shut down. We need to make sure the client is always 
shut down, in the remote executor, and in the CLI frontend. 
  - We can probably remove the lastJobID from the client, it seems 
redundant.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-2102) Add predict operation for LabeledVector

Theodore Vasiloudis created FLINK-2102:
--

 Summary: Add predict operation for LabeledVector
 Key: FLINK-2102
 URL: https://issues.apache.org/jira/browse/FLINK-2102
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor
 Fix For: 0.9


Currently we can only call predict on DataSet[V : Vector].

A lot of times though we have a DataSet[LabeledVector] that we split into a 
train and test set.

We should be able to make predictions on the test DataSet[LabeledVector] 
without having to transform it into a DataSet[Vector]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: implement a simple session management

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/681#issuecomment-105983100
  
After a chat with @mxm , we decided it would be good to merge this sooner 
and fix point 2 (keeping the actor system alive) in a separate patch. I'll try 
and address remarks (1) and (3) in the merging process.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2101) Scheme Inference doesn't work for Tuple5

2015-05-27 Thread Rico Bergmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561147#comment-14561147
 ] 

Rico Bergmann commented on FLINK-2101:
--

Oops. You are right. Sorry for the confusion..

 Scheme Inference doesn't work for Tuple5
 

 Key: FLINK-2101
 URL: https://issues.apache.org/jira/browse/FLINK-2101
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector
Affects Versions: master
Reporter: Rico Bergmann
Assignee: Robert Metzger

 Calling addSink(new KafkaSinkTuple5String, String, String, Long, Double(
   localhost:9092,
   webtrends.ec1601,
   new 
 Utils.TypeInformationSerializationSchemaTuple5String, String, String, Long, 
 Double(
   new 
 Tuple5String, String, String, Long, Double(),
   
 env.getConfig(;
 gives me an Exception stating, that the generic type infos are not given.
 Exception in thread main 
 org.apache.flink.api.common.functions.InvalidTypesException: Tuple needs to 
 be parameterized by using generics.
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:396)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateCreateTypeInfo(TypeExtractor.java:339)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfo(TypeExtractor.java:318)
   at 
 org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.init(ObjectArrayTypeInfo.java:45)
   at 
 org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.getInfoFor(ObjectArrayTypeInfo.java:167)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1150)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1122)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForObject(TypeExtractor.java:1476)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.getForObject(TypeExtractor.java:1446)
   at 
 org.apache.flink.streaming.connectors.kafka.Utils$TypeInformationSerializationSchema.init(Utils.java:37)
   at de.otto.streamexample.WCExample.main(WCExample.java:132)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2101) Scheme Inference doesn't work for Tuple5

2015-05-27 Thread Rico Bergmann (JIRA)

Rico Bergmann created FLINK-2101:


 Summary: Scheme Inference doesn't work for Tuple5
 Key: FLINK-2101
 URL: https://issues.apache.org/jira/browse/FLINK-2101
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector
Affects Versions: master
Reporter: Rico Bergmann


Calling addSink(new KafkaSinkTuple5String, String, String, Long, Double(
localhost:9092,
webtrends.ec1601,
new 
Utils.TypeInformationSerializationSchemaTuple5String, String, String, Long, 
Double(
new 
Tuple5String, String, String, Long, Double(),

env.getConfig(;

gives me an Exception stating, that the generic type infos are not given.

Exception in thread main 
org.apache.flink.api.common.functions.InvalidTypesException: Tuple needs to be 
parameterized by using generics.
at 
org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:396)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.privateCreateTypeInfo(TypeExtractor.java:339)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfo(TypeExtractor.java:318)
at 
org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.init(ObjectArrayTypeInfo.java:45)
at 
org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.getInfoFor(ObjectArrayTypeInfo.java:167)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1150)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1122)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForObject(TypeExtractor.java:1476)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.getForObject(TypeExtractor.java:1446)
at 
org.apache.flink.streaming.connectors.kafka.Utils$TypeInformationSerializationSchema.init(Utils.java:37)
at de.otto.streamexample.WCExample.main(WCExample.java:132)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2101) Scheme Inference doesn't work for Tuple5


[ 
https://issues.apache.org/jira/browse/FLINK-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561063#comment-14561063
 ] 

Robert Metzger commented on FLINK-2101:
---

Thank you for reporting the issue.
I'll look into it.

 Scheme Inference doesn't work for Tuple5
 

 Key: FLINK-2101
 URL: https://issues.apache.org/jira/browse/FLINK-2101
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector
Affects Versions: master
Reporter: Rico Bergmann
Assignee: Robert Metzger

 Calling addSink(new KafkaSinkTuple5String, String, String, Long, Double(
   localhost:9092,
   webtrends.ec1601,
   new 
 Utils.TypeInformationSerializationSchemaTuple5String, String, String, Long, 
 Double(
   new 
 Tuple5String, String, String, Long, Double(),
   
 env.getConfig(;
 gives me an Exception stating, that the generic type infos are not given.
 Exception in thread main 
 org.apache.flink.api.common.functions.InvalidTypesException: Tuple needs to 
 be parameterized by using generics.
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:396)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateCreateTypeInfo(TypeExtractor.java:339)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfo(TypeExtractor.java:318)
   at 
 org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.init(ObjectArrayTypeInfo.java:45)
   at 
 org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.getInfoFor(ObjectArrayTypeInfo.java:167)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1150)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1122)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForObject(TypeExtractor.java:1476)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.getForObject(TypeExtractor.java:1446)
   at 
 org.apache.flink.streaming.connectors.kafka.Utils$TypeInformationSerializationSchema.init(Utils.java:37)
   at de.otto.streamexample.WCExample.main(WCExample.java:132)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2101) Scheme Inference doesn't work for Tuple5


[ 
https://issues.apache.org/jira/browse/FLINK-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561068#comment-14561068
 ] 

Stephan Ewen commented on FLINK-2101:
-

Type inference cannot work on generic, unfortunately, due to Java's generic 
type erasure.

You have to either pass an instance of the Tuple that actually as non-null 
fields, or use a schema parser ({{TypeInfoParser}}).

The exception could be better at this point, though ;-)

 Scheme Inference doesn't work for Tuple5
 

 Key: FLINK-2101
 URL: https://issues.apache.org/jira/browse/FLINK-2101
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector
Affects Versions: master
Reporter: Rico Bergmann
Assignee: Robert Metzger

 Calling addSink(new KafkaSinkTuple5String, String, String, Long, Double(
   localhost:9092,
   webtrends.ec1601,
   new 
 Utils.TypeInformationSerializationSchemaTuple5String, String, String, Long, 
 Double(
   new 
 Tuple5String, String, String, Long, Double(),
   
 env.getConfig(;
 gives me an Exception stating, that the generic type infos are not given.
 Exception in thread main 
 org.apache.flink.api.common.functions.InvalidTypesException: Tuple needs to 
 be parameterized by using generics.
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:396)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateCreateTypeInfo(TypeExtractor.java:339)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfo(TypeExtractor.java:318)
   at 
 org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.init(ObjectArrayTypeInfo.java:45)
   at 
 org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.getInfoFor(ObjectArrayTypeInfo.java:167)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1150)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1122)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForObject(TypeExtractor.java:1476)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.getForObject(TypeExtractor.java:1446)
   at 
 org.apache.flink.streaming.connectors.kafka.Utils$TypeInformationSerializationSchema.init(Utils.java:37)
   at de.otto.streamexample.WCExample.main(WCExample.java:132)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2081) Change order of restore state and open for Streaming Operators


[ 
https://issues.apache.org/jira/browse/FLINK-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561084#comment-14561084
 ] 

Robert Metzger commented on FLINK-2081:
---

Can we come to an agreement here?
I think both approaches are equal. Its a matter of estimating which case is 
more frequent.

I'll now implement the {{PersistentKafkaSource}} assuming that 
{{restoreState()}} is called before {{open()}}. Whoever changes the order as 
part of this issue has to change that in the KafkaSource as well.

 Change order of restore state and open for Streaming Operators
 --

 Key: FLINK-2081
 URL: https://issues.apache.org/jira/browse/FLINK-2081
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.9
Reporter: Aljoscha Krettek
Assignee: Aljoscha Krettek

 Right now, the order is restore state - open. Users often set internal state 
 in the open method, this would overwrite state that was restored from a 
 checkpoint. If we change the order to open - restore this should not be a 
 problem anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2101) Scheme Inference doesn't work for Tuple5


[ 
https://issues.apache.org/jira/browse/FLINK-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561072#comment-14561072
 ] 

Robert Metzger commented on FLINK-2101:
---

The {{KafkaITCase}} is the only code which is currently using the 
{{TypeInformationSerializationSchema}}.
There, you can see how we are passing the Tuple object to the TypeExtractor:  
https://github.com/apache/flink/blob/3ef4e68bf8e7367c81b420d919e2172aebbcd507/flink-staging/flink-streaming/flink-streaming-connectors/src/test/java/org/apache/flink/streaming/connectors/kafka/KafkaITCase.java#L265

I'll improve the exception message.


 Scheme Inference doesn't work for Tuple5
 

 Key: FLINK-2101
 URL: https://issues.apache.org/jira/browse/FLINK-2101
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector
Affects Versions: master
Reporter: Rico Bergmann
Assignee: Robert Metzger

 Calling addSink(new KafkaSinkTuple5String, String, String, Long, Double(
   localhost:9092,
   webtrends.ec1601,
   new 
 Utils.TypeInformationSerializationSchemaTuple5String, String, String, Long, 
 Double(
   new 
 Tuple5String, String, String, Long, Double(),
   
 env.getConfig(;
 gives me an Exception stating, that the generic type infos are not given.
 Exception in thread main 
 org.apache.flink.api.common.functions.InvalidTypesException: Tuple needs to 
 be parameterized by using generics.
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:396)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateCreateTypeInfo(TypeExtractor.java:339)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfo(TypeExtractor.java:318)
   at 
 org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.init(ObjectArrayTypeInfo.java:45)
   at 
 org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.getInfoFor(ObjectArrayTypeInfo.java:167)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1150)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1122)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForObject(TypeExtractor.java:1476)
   at 
 org.apache.flink.api.java.typeutils.TypeExtractor.getForObject(TypeExtractor.java:1446)
   at 
 org.apache.flink.streaming.connectors.kafka.Utils$TypeInformationSerializationSchema.init(Utils.java:37)
   at de.otto.streamexample.WCExample.main(WCExample.java:132)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1952] [jobmanager] Rework and fix slot ...

2015-05-27 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/731


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1952) Cannot run ConnectedComponents example: Could not allocate a slot on instance


[ 
https://issues.apache.org/jira/browse/FLINK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561106#comment-14561106
 ] 

ASF GitHub Bot commented on FLINK-1952:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/731#issuecomment-105945508
  
Thanks!

I'll address the comments, remove the accidentally added `log4j.properties` 
file and merge this.


 Cannot run ConnectedComponents example: Could not allocate a slot on instance
 -

 Key: FLINK-1952
 URL: https://issues.apache.org/jira/browse/FLINK-1952
 Project: Flink
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 0.9
Reporter: Robert Metzger
Priority: Blocker

 Steps to reproduce
 {code}
 ./bin/yarn-session.sh -n 350 
 {code}
 ... wait until they are connected ...
 {code}
 Number of connected TaskManagers changed to 266. Slots available: 266
 Number of connected TaskManagers changed to 323. Slots available: 323
 Number of connected TaskManagers changed to 334. Slots available: 334
 Number of connected TaskManagers changed to 343. Slots available: 343
 Number of connected TaskManagers changed to 350. Slots available: 350
 {code}
 Start CC
 {code}
 ./bin/flink run -p 350 
 ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar
 {code}
 --- it runs
 Run KMeans, let it fail with 
 {code}
 Failed to deploy the task Map (Map at main(KMeans.java:100)) (1/350) - 
 execution #0 to slot SimpleSlot (2)(2)(0) - 182b7661ca9547a84591de940c47a200 
 - ALLOCATED/ALIVE: java.io.IOException: Insufficient number of network 
 buffers: required 350, but only 254 available. The total number of network 
 buffers is currently set to 2048. You can increase this number by setting the 
 configuration key 'taskmanager.network.numberOfBuffers'.
 {code}
 ... as expected.
 (I've waited for 10 minutes between the two submissions)
 Starting CC now will fail:
 {code}
 ./bin/flink run -p 350 
 ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar 
 {code}
 Error message(s):
 {code}
 Caused by: java.lang.IllegalStateException: Could not schedule consumer 
 vertex IterationHead(WorksetIteration (Unnamed Delta Iteration)) (19/350)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:479)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:469)
   at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
   ... 4 more
 Caused by: 
 org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
 Could not allocate a slot on instance 4a6d761cb084c32310ece1f849556faf @ 
 cloud-19 - 1 slots - URL: 
 akka.tcp://flink@130.149.21.23:51400/user/taskmanager, as required by the 
 co-location constraint.
   at 
 org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:247)
   at 
 org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:110)
   at 
 org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:262)
   at 
 org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:436)
   at 
 org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:475)
   ... 9 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1952] [jobmanager] Rework and fix slot ...

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/731#issuecomment-105945508
  
Thanks!

I'll address the comments, remove the accidentally added `log4j.properties` 
file and merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2004] Fix memory leak in presence of fa...

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/674#issuecomment-105951860
  
I have those unit test style tests all in the KafkaITCase because they 
depend on the testing clusters started for the test.
They all need at least a running Zookeeper instance.
I can of course put them into a different class, but this will further slow 
down our tests because we spend more time starting and stopping zookeeper.

I've reworked the restore to reflect the open() / restoreState() order.
The PR has been updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2004) Memory leak in presence of failed checkpoints in KafkaSource


[ 
https://issues.apache.org/jira/browse/FLINK-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561122#comment-14561122
 ] 

ASF GitHub Bot commented on FLINK-2004:
---

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/674#issuecomment-105951860
  
I have those unit test style tests all in the KafkaITCase because they 
depend on the testing clusters started for the test.
They all need at least a running Zookeeper instance.
I can of course put them into a different class, but this will further slow 
down our tests because we spend more time starting and stopping zookeeper.

I've reworked the restore to reflect the open() / restoreState() order.
The PR has been updated.


 Memory leak in presence of failed checkpoints in KafkaSource
 

 Key: FLINK-2004
 URL: https://issues.apache.org/jira/browse/FLINK-2004
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Robert Metzger
Priority: Critical
 Fix For: 0.9


 Checkpoints that fail never send a commit message to the tasks.
 Maintaining a map of all pending checkpoints introduces a memory leak, as 
 entries for failed checkpoints will never be removed.
 Approaches to fix this:
   - The source cleans up entries from older checkpoints once a checkpoint is 
 committed (simple implementation in a linked hash map)
   - The commit message could include the optional state handle (source needs 
 not maintain the map)
   - The checkpoint coordinator could send messages for failed checkpoints?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1874) Break up streaming connectors into submodules


[ 
https://issues.apache.org/jira/browse/FLINK-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561038#comment-14561038
 ] 

ASF GitHub Bot commented on FLINK-1874:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/719#issuecomment-105926196
  
Should we so clearly differentiate between the streaming and the batch 
connector projects?

Why not have a `flink-connector-kafka` that contains both a batch and a 
streaming source? They will probably share a lot of code anyways.

Long term, when the streaming API supports batch API data sources (as 
bounded streams), this will come as a natural fit.


 Break up streaming connectors into submodules
 -

 Key: FLINK-1874
 URL: https://issues.apache.org/jira/browse/FLINK-1874
 Project: Flink
  Issue Type: Task
  Components: Build System, Streaming
Affects Versions: 0.9
Reporter: Robert Metzger
Assignee: Márton Balassi
  Labels: starter

 As per: 
 http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Break-up-streaming-connectors-into-subprojects-td5001.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [scheduling] implement backtracking of interme...

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/640#discussion_r31137592
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResultPartitionManager.java
 ---
@@ -137,9 +211,46 @@ void onConsumedPartition(ResultPartition partition) {
 
// Release the partition if it was successfully removed
if (partition == previous) {
-   partition.release();
+   // move to cache if cachable
+   updateIntermediateResultPartitionCache(partition);
 
-   LOG.debug(Released {}., partition);
+   LOG.debug(Cached {}., partition);
+   }
+   }
+
+   /**
+* Triggered by @link{NetworkBufferPool} when network buffers should be 
freed
+* @param requiredBuffers The number of buffers that should be cleared.
+*/
+   public boolean releaseLeastRecentlyUsedCachedPartitions (int 
requiredBuffers) {
+   synchronized (cachedResultPartitions) {
+   // make a list of ResultPartitions to release
+   ListResultPartition toBeReleased = new 
ArrayListResultPartition();
+   int numBuffersToBeFreed = 0;
+
+   // traverse from least recently used cached 
ResultPartition
+   for (Map.EntryIntermediateResultPartitionID, 
ResultPartition entry : cachedResultPartitions.entrySet()) {
+   ResultPartition cachedResult = entry.getValue();
+
+   synchronized (registeredPartitions) {
+   if 
(!registeredPartitions.containsValue(cachedResult)) {
+   if (numBuffersToBeFreed  
requiredBuffers) {
+   
toBeReleased.add(cachedResult);
+   numBuffersToBeFreed += 
cachedResult.getTotalNumberOfBuffers();
--- End diff --

Sorry for the bad naming, but this is the total number of produced buffers. 
We need the number of buffers, which are currently in memory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1874] [streaming] Connector breakup

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/719#issuecomment-105926196
  
Should we so clearly differentiate between the streaming and the batch 
connector projects?

Why not have a `flink-connector-kafka` that contains both a batch and a 
streaming source? They will probably share a lot of code anyways.

Long term, when the streaming API supports batch API data sources (as 
bounded streams), this will come as a natural fit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2004) Memory leak in presence of failed checkpoints in KafkaSource


[ 
https://issues.apache.org/jira/browse/FLINK-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561037#comment-14561037
 ] 

ASF GitHub Bot commented on FLINK-2004:
---

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/674#issuecomment-105926083
  
Sorry for not reacting. I was busy with other stuff but I'll now address 
your comments.


 Memory leak in presence of failed checkpoints in KafkaSource
 

 Key: FLINK-2004
 URL: https://issues.apache.org/jira/browse/FLINK-2004
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Robert Metzger
Priority: Critical
 Fix For: 0.9


 Checkpoints that fail never send a commit message to the tasks.
 Maintaining a map of all pending checkpoints introduces a memory leak, as 
 entries for failed checkpoints will never be removed.
 Approaches to fix this:
   - The source cleans up entries from older checkpoints once a checkpoint is 
 committed (simple implementation in a linked hash map)
   - The commit message could include the optional state handle (source needs 
 not maintain the map)
   - The checkpoint coordinator could send messages for failed checkpoints?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [scheduling] implement backtracking of interme...

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/640#discussion_r31138771
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResultSubpartition.java
 ---
@@ -71,6 +71,11 @@ protected void onConsumedSubpartition() {
 
abstract public void finish() throws IOException;
 
+   /**
--- End diff --

Sorry for the missing comment. This acutally discards the result partition. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Closed] (FLINK-1952) Cannot run ConnectedComponents example: Could not allocate a slot on instance