http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/apis/zip_elements_guide.md ---------------------------------------------------------------------- diff --git a/docs/apis/zip_elements_guide.md b/docs/apis/zip_elements_guide.md deleted file mode 100644 index b636fe4..0000000 --- a/docs/apis/zip_elements_guide.md +++ /dev/null @@ -1,106 +0,0 @@ ---- -title: "Zipping Elements in a DataSet" ---- -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -In certain algorithms, one may need to assign unique identifiers to data set elements. -This document shows how {% gh_link /flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java "DataSetUtils" %} can be used for that purpose. - -* This will be replaced by the TOC -{:toc} - -### Zip with a Dense Index -For assigning consecutive labels to the elements, the `zipWithIndex` method should be called. It receives a data set as input and returns a new data set of unique id, initial value tuples. -For example, the following code: - -<div class="codetabs" markdown="1"> -<div data-lang="java" markdown="1"> -{% highlight java %} -ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); -env.setParallelism(1); -DataSet<String> in = env.fromElements("A", "B", "C", "D", "E", "F"); - -DataSet<Tuple2<Long, String>> result = DataSetUtils.zipWithIndex(in); - -result.writeAsCsv(resultPath, "\n", ","); -env.execute(); -{% endhighlight %} -</div> - -<div data-lang="scala" markdown="1"> -{% highlight scala %} -import org.apache.flink.api.scala._ - -val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment -env.setParallelism(1) -val input: DataSet[String] = env.fromElements("A", "B", "C", "D", "E", "F") - -val result: DataSet[(Long, String)] = input.zipWithIndex - -result.writeAsCsv(resultPath, "\n", ",") -env.execute() -{% endhighlight %} -</div> - -</div> - -will yield the tuples: (0,A), (1,B), (2,C), (3,D), (4,E), (5,F) - -[Back to top](#top) - -### Zip with an Unique Identifier -In many cases, one may not need to assign consecutive labels. -`zipWIthUniqueId` works in a pipelined fashion, speeding up the label assignment process. This method receives a data set as input and returns a new data set of unique id, initial value tuples. -For example, the following code: - -<div class="codetabs" markdown="1"> -<div data-lang="java" markdown="1"> -{% highlight java %} -ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); -env.setParallelism(1); -DataSet<String> in = env.fromElements("A", "B", "C", "D", "E", "F"); - -DataSet<Tuple2<Long, String>> result = DataSetUtils.zipWithUniqueId(in); - -result.writeAsCsv(resultPath, "\n", ","); -env.execute(); -{% endhighlight %} -</div> - -<div data-lang="scala" markdown="1"> -{% highlight scala %} -import org.apache.flink.api.scala._ - -val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment -env.setParallelism(1) -val input: DataSet[String] = env.fromElements("A", "B", "C", "D", "E", "F") - -val result: DataSet[(Long, String)] = input.zipWithUniqueId - -result.writeAsCsv(resultPath, "\n", ",") -env.execute() -{% endhighlight %} -</div> - -</div> - -will yield the tuples: (0,A), (2,B), (4,C), (6,D), (8,E), (10,F) - -[Back to top](#top) \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/internals/add_operator.md ---------------------------------------------------------------------- diff --git a/docs/internals/add_operator.md b/docs/internals/add_operator.md index 241304d..8dad0af 100644 --- a/docs/internals/add_operator.md +++ b/docs/internals/add_operator.md @@ -1,5 +1,9 @@ --- title: "How to add a new Operator" +# Top navigation +top-nav-group: internals +top-nav-pos: 8 +top-nav-title: "How-To: Add an Operator" --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/internals/general_arch.md ---------------------------------------------------------------------- diff --git a/docs/internals/general_arch.md b/docs/internals/general_arch.md index b49d9ef..dc50eb7 100644 --- a/docs/internals/general_arch.md +++ b/docs/internals/general_arch.md @@ -1,5 +1,9 @@ --- title: "General Architecture and Process Model" +# Top navigation +top-nav-group: internals +top-nav-pos: 3 +top-nav-title: Architecture and Process Model --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/internals/ide_setup.md ---------------------------------------------------------------------- diff --git a/docs/internals/ide_setup.md b/docs/internals/ide_setup.md index 1e0e77a..1b0b91a 100644 --- a/docs/internals/ide_setup.md +++ b/docs/internals/ide_setup.md @@ -1,5 +1,8 @@ --- -title: "IDE setup" +title: "IDE Setup" +# Top navigation +top-nav-group: internals +top-nav-pos: 1 --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/internals/job_scheduling.md ---------------------------------------------------------------------- diff --git a/docs/internals/job_scheduling.md b/docs/internals/job_scheduling.md index 7e24cdb..cce78d9 100644 --- a/docs/internals/job_scheduling.md +++ b/docs/internals/job_scheduling.md @@ -1,5 +1,8 @@ --- title: "Jobs and Scheduling" +# Top navigation +top-nav-group: internals +top-nav-pos: 7 --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/internals/logging.md ---------------------------------------------------------------------- diff --git a/docs/internals/logging.md b/docs/internals/logging.md index dee3d01..d2c0cba 100644 --- a/docs/internals/logging.md +++ b/docs/internals/logging.md @@ -1,5 +1,9 @@ --- title: "How to use logging" +# Top navigation +top-nav-group: internals +top-nav-pos: 2 +top-nav-title: Logging --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/internals/monitoring_rest_api.md ---------------------------------------------------------------------- diff --git a/docs/internals/monitoring_rest_api.md b/docs/internals/monitoring_rest_api.md index 643db6b..70952f5 100644 --- a/docs/internals/monitoring_rest_api.md +++ b/docs/internals/monitoring_rest_api.md @@ -1,5 +1,8 @@ --- title: "Monitoring REST API" +# Top navigation +top-nav-group: internals +top-nav-pos: 6 --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/internals/stream_checkpointing.md ---------------------------------------------------------------------- diff --git a/docs/internals/stream_checkpointing.md b/docs/internals/stream_checkpointing.md index dbfd341..af005b7 100644 --- a/docs/internals/stream_checkpointing.md +++ b/docs/internals/stream_checkpointing.md @@ -1,5 +1,9 @@ --- title: "Data Streaming Fault Tolerance" +# Top navigation +top-nav-group: internals +top-nav-pos: 4 +top-nav-title: Fault Tolerance for Data Streaming --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/internals/types_serialization.md ---------------------------------------------------------------------- diff --git a/docs/internals/types_serialization.md b/docs/internals/types_serialization.md index 8a93ccd..7ff21f2 100644 --- a/docs/internals/types_serialization.md +++ b/docs/internals/types_serialization.md @@ -1,5 +1,8 @@ --- title: "Type Extraction and Serialization" +# Top navigation +top-nav-group: internals +top-nav-pos: 5 --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/gelly_guide.md ---------------------------------------------------------------------- diff --git a/docs/libs/gelly_guide.md b/docs/libs/gelly_guide.md index ff14506..a3ede7b 100644 --- a/docs/libs/gelly_guide.md +++ b/docs/libs/gelly_guide.md @@ -1,5 +1,14 @@ --- title: "Gelly: Flink Graph API" +# Top navigation +top-nav-group: libs +top-nav-pos: 1 +top-nav-title: "Graphs: Gelly" +# Sub navigation +sub-nav-group: batch +sub-nav-parent: libs +sub-nav-pos: 1 +sub-nav-title: Gelly --- <!-- Licensed to the Apache Software Foundation (ASF) under one @@ -20,8 +29,6 @@ specific language governing permissions and limitations under the License. --> -<a href="#top"></a> - Gelly is a Graph API for Flink. It contains a set of methods and utilities which aim to simplify the development of graph analysis applications in Flink. In Gelly, graphs can be transformed and modified using high-level functions similar to the ones provided by the batch processing API. Gelly provides methods to create, transform and modify graphs, as well as a library of graph algorithms. * This will be replaced by the TOC @@ -114,7 +121,7 @@ val weight = e.getValue // weight = 0.5 </div> </div> -[Back to top](#top) +{% top %} Graph Creation ----------- @@ -219,11 +226,11 @@ val graph = Graph.fromTupleDataSet(vertexTuples, edgeTuples, env) {% endhighlight %} * from a CSV file of Edge data and an optional CSV file of Vertex data. -In this case, Gelly will convert each row from the Edge CSV file to an `Edge`. -The first field of the each row will be the source ID, the second field will be the target ID and the third field (if present) will be the edge value. +In this case, Gelly will convert each row from the Edge CSV file to an `Edge`. +The first field of the each row will be the source ID, the second field will be the target ID and the third field (if present) will be the edge value. If the edges have no associated value, set the edge value type parameter (3rd type argument) to `NullValue`. You can also specify that the vertices are initialized with a vertex value. -If you provide a path to a CSV file via `pathVertices`, each row of this file will be converted to a `Vertex`. +If you provide a path to a CSV file via `pathVertices`, each row of this file will be converted to a `Vertex`. The first field of each row will be the vertex ID and the second field will be the vertex value. If you provide a vertex value initializer `MapFunction` via the `vertexValueInitializer` parameter, then this function is used to generate the vertex values. The set of vertices will be created automatically from the edges input. @@ -244,7 +251,7 @@ val graph = Graph.fromCsvReader[String, Long, Double]( val simpleGraph = Graph.fromCsvReader[Long, NullValue, NullValue]( pathEdges = "path/to/edge/input", env = env) - + // create a Graph with Double Vertex values generated by a vertex value initializer and no Edge values val simpleGraph = Graph.fromCsvReader[Long, Double, NullValue]( pathEdges = "path/to/edge/input", @@ -279,11 +286,11 @@ If no vertex input is provided during Graph creation, Gelly will automatically p ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // initialize the vertex value to be equal to the vertex ID -Graph<Long, Long, String> graph = Graph.fromCollection(edgeList, +Graph<Long, Long, String> graph = Graph.fromCollection(edgeList, new MapFunction<Long, Long>() { - public Long map(Long value) { - return value; - } + public Long map(Long value) { + return value; + } }, env); {% endhighlight %} </div> @@ -313,7 +320,7 @@ val graph = Graph.fromCollection(edgeList, </div> </div> -[Back to top](#top) +{% top %} Graph Properties ------------ @@ -333,10 +340,10 @@ DataSet<Edge<K, EV>> getEdges() DataSet<K> getVertexIds() // get the source-target pairs of the edge IDs as a DataSet -DataSet<Tuple2<K, K>> getEdgeIds() +DataSet<Tuple2<K, K>> getEdgeIds() // get a DataSet of <vertex ID, in-degree> pairs for all vertices -DataSet<Tuple2<K, Long>> inDegrees() +DataSet<Tuple2<K, Long>> inDegrees() // get a DataSet of <vertex ID, out-degree> pairs for all vertices DataSet<Tuple2<K, Long>> outDegrees() @@ -392,7 +399,7 @@ getTriplets: DataSet[Triplet[K, VV, EV]] </div> </div> -[Back to top](#top) +{% top %} Graph Transformations ----------------- @@ -510,14 +517,14 @@ val networkWithWeights = network.joinWithEdgesOnSource(vertexOutDegrees, (v1: Do * <strong>Difference</strong>: Gelly's `difference()` method performs a difference on the vertex and edge sets of the current graph and the specified graph. * <strong>Intersect</strong>: Gelly's `intersect()` method performs an intersect on the edge - sets of the current graph and the specified graph. The result is a new `Graph` that contains all + sets of the current graph and the specified graph. The result is a new `Graph` that contains all edges that exist in both input graphs. Two edges are considered equal, if they have the same source - identifier, target identifier and edge value. Vertices in the resulting graph have no - value. If vertex values are required, one can for example retrieve them from one of the input graphs using + identifier, target identifier and edge value. Vertices in the resulting graph have no + value. If vertex values are required, one can for example retrieve them from one of the input graphs using the `joinWithVertices()` method. - Depending on the parameter `distinct`, equal edges are either contained once in the resulting + Depending on the parameter `distinct`, equal edges are either contained once in the resulting `Graph` or as often as there are pairs of equal edges in the input graphs. - + <div class="codetabs" markdown="1"> <div data-lang="java" markdown="1"> {% highlight java %} @@ -562,7 +569,7 @@ val intersect2 = graph1.intersect(graph2, false) </div> </div> --[Back to top](#top) +-{% top %} Graph Mutations ----------- @@ -734,7 +741,7 @@ Graph<Long, Long, Double> graph = ... DataSet<Tuple2<Vertex<Long, Long>, Vertex<Long, Long>>> vertexPairs = graph.groupReduceOnNeighbors(new SelectLargeWeightNeighbors(), EdgeDirection.OUT); // user-defined function to select the neighbors which have edges with weight > 0.5 -static final class SelectLargeWeightNeighbors implements NeighborsFunctionWithVertexValue<Long, Long, Double, +static final class SelectLargeWeightNeighbors implements NeighborsFunctionWithVertexValue<Long, Long, Double, Tuple2<Vertex<Long, Long>, Vertex<Long, Long>>> { @Override @@ -759,7 +766,7 @@ val graph: Graph[Long, Long, Double] = ... val vertexPairs = graph.groupReduceOnNeighbors(new SelectLargeWeightNeighbors, EdgeDirection.OUT) // user-defined function to select the neighbors which have edges with weight > 0.5 -final class SelectLargeWeightNeighbors extends NeighborsFunctionWithVertexValue[Long, Long, Double, +final class SelectLargeWeightNeighbors extends NeighborsFunctionWithVertexValue[Long, Long, Double, (Vertex[Long, Long], Vertex[Long, Long])] { override def iterateNeighbors(vertex: Vertex[Long, Long], @@ -779,7 +786,7 @@ final class SelectLargeWeightNeighbors extends NeighborsFunctionWithVertexValue[ When the aggregation computation does not require access to the vertex value (for which the aggregation is performed), it is advised to use the more efficient `EdgesFunction` and `NeighborsFunction` for the user-defined functions. When access to the vertex value is required, one should use `EdgesFunctionWithVertexValue` and `NeighborsFunctionWithVertexValue` instead. -[Back to top](#top) +{% top %} Iterative Graph Processing ----------- @@ -809,7 +816,7 @@ Let us consider computing Single-Source-Shortest-Paths with vertex-centric itera <div class="codetabs" markdown="1"> <div data-lang="java" markdown="1"> -{% highlight java %} +{% highlight java %} // read the input graph Graph<Long, Double, Double> graph = ... @@ -858,7 +865,7 @@ public static final class VertexDistanceUpdater extends VertexUpdateFunction<Lon </div> <div data-lang="scala" markdown="1"> -{% highlight scala %} +{% highlight scala %} // read the input graph val graph: Graph[Long, Double, Double] = ... @@ -906,23 +913,23 @@ final class VertexDistanceUpdater extends VertexUpdateFunction[Long, Double, Dou </div> </div> -[Back to top](#top) +{% top %} ### Configuring a Vertex-Centric Iteration A vertex-centric iteration can be configured using a `VertexCentricConfiguration` object. Currently, the following parameters can be specified: -* <strong>Name</strong>: The name for the vertex-centric iteration. The name is displayed in logs and messages +* <strong>Name</strong>: The name for the vertex-centric iteration. The name is displayed in logs and messages and can be specified using the `setName()` method. -* <strong>Parallelism</strong>: The parallelism for the iteration. It can be set using the `setParallelism()` method. +* <strong>Parallelism</strong>: The parallelism for the iteration. It can be set using the `setParallelism()` method. * <strong>Solution set in unmanaged memory</strong>: Defines whether the solution set is kept in managed memory (Flink's internal way of keeping objects in serialized form) or as a simple object map. By default, the solution set runs in managed memory. This property can be set using the `setSolutionSetUnmanagedMemory()` method. * <strong>Aggregators</strong>: Iteration aggregators can be registered using the `registerAggregator()` method. An iteration aggregator combines all aggregates globally once per superstep and makes them available in the next superstep. Registered aggregators can be accessed inside the user-defined `VertexUpdateFunction` and `MessagingFunction`. -* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/programming_guide.html#broadcast-variables) to the `VertexUpdateFunction` and `MessagingFunction`, using the `addBroadcastSetForUpdateFunction()` and `addBroadcastSetForMessagingFunction()` methods, respectively. +* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/batch/index.html#broadcast-variables) to the `VertexUpdateFunction` and `MessagingFunction`, using the `addBroadcastSetForUpdateFunction()` and `addBroadcastSetForMessagingFunction()` methods, respectively. * <strong>Number of Vertices</strong>: Accessing the total number of vertices within the iteration. This property can be set using the `setOptNumVertices()` method. The number of vertices can then be accessed in the vertex update function and in the messaging function using the `getNumberOfVertices()` method. If the option is not set in the configuration, this method will return -1. @@ -952,7 +959,7 @@ parameters.setParallelism(16); parameters.registerAggregator("sumAggregator", new LongSumAggregator()); // run the vertex-centric iteration, also passing the configuration parameters -Graph<Long, Double, Double> result = +Graph<Long, Double, Double> result = graph.runVertexCentricIteration( new VertexUpdater(), new Messenger(), maxIterations, parameters); @@ -962,14 +969,14 @@ public static final class VertexUpdater extends VertexUpdateFunction { LongSumAggregator aggregator = new LongSumAggregator(); public void preSuperstep() { - + // retrieve the Aggregator aggregator = getIterationAggregator("sumAggregator"); } public void updateVertex(Vertex<Long, Long> vertex, MessageIterator inMessages) { - + //do some computation Long partialValue = ... @@ -1011,14 +1018,14 @@ final class VertexUpdater extends VertexUpdateFunction { var aggregator = new LongSumAggregator override def preSuperstep { - + // retrieve the Aggregator aggregator = getIterationAggregator("sumAggregator") } override def updateVertex(vertex: Vertex[Long, Long], inMessages: MessageIterator[Long]) { - + //do some computation val partialValue = ... @@ -1162,7 +1169,7 @@ final class Messenger {...} </div> </div> -[Back to top](#top) +{% top %} ### Gather-Sum-Apply Iterations Like in the vertex-centric model, Gather-Sum-Apply also proceeds in synchronized iterative steps, called supersteps. Each superstep consists of the following three phases: @@ -1289,7 +1296,7 @@ Note that `gather` takes a `Neighbor` type as an argument. This is a convenience For more examples of how to implement algorithms with the Gather-Sum-Apply model, check the {% gh_link /flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/GSAPageRank.java "GSAPageRank" %} and {% gh_link /flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/GSAConnectedComponents.java "GSAConnectedComponents" %} library methods of Gelly. -[Back to top](#top) +{% top %} ### Configuring a Gather-Sum-Apply Iteration A GSA iteration can be configured using a `GSAConfiguration` object. @@ -1303,7 +1310,7 @@ Currently, the following parameters can be specified: * <strong>Aggregators</strong>: Iteration aggregators can be registered using the `registerAggregator()` method. An iteration aggregator combines all aggregates globally once per superstep and makes them available in the next superstep. Registered aggregators can be accessed inside the user-defined `GatherFunction`, `SumFunction` and `ApplyFunction`. -* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/programming_guide.html#broadcast-variables) to the `GatherFunction`, `SumFunction` and `ApplyFunction`, using the methods `addBroadcastSetForGatherFunction()`, `addBroadcastSetForSumFunction()` and `addBroadcastSetForApplyFunction` methods, respectively. +* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/index.html#broadcast-variables) to the `GatherFunction`, `SumFunction` and `ApplyFunction`, using the methods `addBroadcastSetForGatherFunction()`, `addBroadcastSetForSumFunction()` and `addBroadcastSetForApplyFunction` methods, respectively. * <strong>Number of Vertices</strong>: Accessing the total number of vertices within the iteration. This property can be set using the `setOptNumVertices()` method. The number of vertices can then be accessed in the gather, sum and/or apply functions by using the `getNumberOfVertices()` method. If the option is not set in the configuration, this method will return -1. @@ -1433,7 +1440,7 @@ val result = graph.runGatherSumApplyIteration(new Gather, new Sum, new Apply, ma {% endhighlight %} </div> </div> -[Back to top](#top) +{% top %} ### Vertex-centric and GSA Comparison As seen in the examples above, Gather-Sum-Apply iterations are quite similar to vertex-centric iterations. In fact, any algorithm which can be expressed as a GSA iteration can also be written in the vertex-centric model. @@ -1466,7 +1473,7 @@ List<Edge<Long, Long>> edges = ... Graph<Long, Long, Long> graph = Graph.fromCollection(vertices, edges, env); // will return false: 6 is an invalid ID -graph.validate(new InvalidVertexIdsValidator<Long, Long, Long>()); +graph.validate(new InvalidVertexIdsValidator<Long, Long, Long>()); {% endhighlight %} </div> @@ -1490,7 +1497,7 @@ graph.validate(new InvalidVertexIdsValidator[Long, Long, Long]) </div> </div> -[Back to top](#top) +{% top %} Library Methods ----------- @@ -1552,7 +1559,7 @@ This library method is an implementation of the community detection algorithm de The algorithm is implemented using [vertex-centric iterations](#vertex-centric-iterations). Initially, each vertex is assigned a `Tuple2` containing its initial value along with a score equal to 1.0. In each iteration, vertices send their labels and scores to their neighbors. Upon receiving messages from its neighbors, -a vertex chooses the label with the highest score and subsequently re-scores it using the edge values, +a vertex chooses the label with the highest score and subsequently re-scores it using the edge values, a user-defined hop attenuation parameter, `delta`, and the superstep number. The algorithm converges when vertices no longer update their value or when the maximum number of iterations is reached. @@ -1690,25 +1697,25 @@ Each `Tuple3` corresponds to a triangle, with the fields containing the IDs of t ### Summarization #### Overview -The summarization algorithm computes a condensed version of the input graph by grouping vertices and edges based on +The summarization algorithm computes a condensed version of the input graph by grouping vertices and edges based on their values. In doing so, the algorithm helps to uncover insights about patterns and distributions in the graph. One possible use case is the visualization of communities where the whole graph is too large and needs to be summarized based on the community identifier stored at a vertex. #### Details -In the resulting graph, each vertex represents a group of vertices that share the same value. An edge, that connects a -vertex with itself, represents all edges with the same edge value that connect vertices from the same vertex group. An -edge between different vertices in the output graph represents all edges with the same edge value between members of +In the resulting graph, each vertex represents a group of vertices that share the same value. An edge, that connects a +vertex with itself, represents all edges with the same edge value that connect vertices from the same vertex group. An +edge between different vertices in the output graph represents all edges with the same edge value between members of different vertex groups in the input graph. The algorithm is implemented using Flink data operators. First, vertices are grouped by their value and a representative -is chosen from each group. For any edge, the source and target vertex identifiers are replaced with the corresponding +is chosen from each group. For any edge, the source and target vertex identifiers are replaced with the corresponding representative and grouped by source, target and edge value. Output vertices and edges are created from their corresponding groupings. #### Usage The algorithm takes a directed, vertex (and possibly edge) attributed graph as input and outputs a new graph where each -vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each +vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each vertex and edge in the output graph stores the common group value and the number of represented elements. -[Back to top](#top) +{% top %} http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/index.md ---------------------------------------------------------------------- diff --git a/docs/libs/index.md b/docs/libs/index.md index cf5f846..b2df0c4 100644 --- a/docs/libs/index.md +++ b/docs/libs/index.md @@ -1,5 +1,9 @@ --- title: "Libraries" +sub-nav-group: batch +sub-nav-id: libs +sub-nav-pos: 6 +sub-nav-title: Libraries --- <!-- Licensed to the Apache Software Foundation (ASF) under one @@ -18,4 +22,8 @@ software distributed under the License is distributed on an KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ---> \ No newline at end of file +--> + +- Graph processing: [Gelly](gelly_guide.html) +- Machine Learning: [FlinkML](ml/index.html) +- Relational Queries: [Table](table.html) http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/als.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/als.md b/docs/libs/ml/als.md index bd45bb0..cf85399 100644 --- a/docs/libs/ml/als.md +++ b/docs/libs/ml/als.md @@ -1,7 +1,11 @@ --- mathjax: include -htmlTitle: FlinkML - Alternating Least Squares -title: <a href="../ml">FlinkML</a> - Alternating Least Squares +title: FlinkML - Alternating Least Squares + +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: ALS --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/contribution_guide.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/contribution_guide.md b/docs/libs/ml/contribution_guide.md index d40290b..6376958 100644 --- a/docs/libs/ml/contribution_guide.md +++ b/docs/libs/ml/contribution_guide.md @@ -1,7 +1,11 @@ --- mathjax: include -htmlTitle: FlinkML - How to Contribute -title: <a href="../ml">FlinkML</a> - How to Contribute +title: FlinkML - How to Contribute + +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: How To Contribute --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/distance_metrics.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/distance_metrics.md b/docs/libs/ml/distance_metrics.md index e6868c8..1a7364a 100644 --- a/docs/libs/ml/distance_metrics.md +++ b/docs/libs/ml/distance_metrics.md @@ -1,7 +1,11 @@ --- mathjax: include -htmlTitle: FlinkML - Distance Metrics -title: <a href="../ml">FlinkML</a> - Distance Metrics +title: FlinkML - Distance Metrics + +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: Distance Metrics --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/index.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/index.md b/docs/libs/ml/index.md index dc35c34..973c4f3 100644 --- a/docs/libs/ml/index.md +++ b/docs/libs/ml/index.md @@ -1,5 +1,15 @@ --- title: "FlinkML - Machine Learning for Flink" +# Top navigation +top-nav-group: libs +top-nav-pos: 2 +top-nav-title: Machine Learning +# Sub navigation +sub-nav-group: batch +sub-nav-id: flinkml +sub-nav-pos: 2 +sub-nav-parent: libs +sub-nav-title: Machine Learning --- <!-- Licensed to the Apache Software Foundation (ASF) under one @@ -69,7 +79,7 @@ Next, you have to add the FlinkML dependency to the `pom.xml` of your project. </dependency> {% endhighlight %} -Note that FlinkML is currently not part of the binary distribution. +Note that FlinkML is currently not part of the binary distribution. See linking with it for cluster execution [here]({{site.baseurl}}/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution). Now you can start solving your analysis task. http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/min_max_scaler.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/min_max_scaler.md b/docs/libs/ml/min_max_scaler.md index 0c00dcd..302bf4d 100644 --- a/docs/libs/ml/min_max_scaler.md +++ b/docs/libs/ml/min_max_scaler.md @@ -1,7 +1,11 @@ --- mathjax: include -htmlTitle: FlinkML - MinMax Scaler title: <a href="../ml">FlinkML</a> - MinMax Scaler + +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: MinMax Scaler --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/multiple_linear_regression.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/multiple_linear_regression.md b/docs/libs/ml/multiple_linear_regression.md index aaf1fbf..e0085ae 100644 --- a/docs/libs/ml/multiple_linear_regression.md +++ b/docs/libs/ml/multiple_linear_regression.md @@ -1,7 +1,11 @@ --- mathjax: include -htmlTitle: FlinkML - Multiple linear regression -title: <a href="../ml">FlinkML</a> - Multiple linear regression +title: FlinkML - Multiple linear regression + +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: Multiple Linear Regression --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/optimization.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/optimization.md b/docs/libs/ml/optimization.md index 110383d..9bcebaa 100644 --- a/docs/libs/ml/optimization.md +++ b/docs/libs/ml/optimization.md @@ -1,7 +1,10 @@ --- mathjax: include -htmlTitle: FlinkML - Optimization -title: <a href="../ml">FlinkML</a> - Optimization +title: FlinkML - Optimization +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: Optimization --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/pipelines.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/pipelines.md b/docs/libs/ml/pipelines.md index 04df321..429156d 100644 --- a/docs/libs/ml/pipelines.md +++ b/docs/libs/ml/pipelines.md @@ -1,7 +1,10 @@ --- mathjax: include -htmlTitle: FlinkML - Looking under the hood of piplines -title: <a href="../ml">FlinkML</a> - Looking under the hood of pipelines +title: FlinkML - Looking under the hood of pipelines +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: Pipelines --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/polynomial_features.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/polynomial_features.md b/docs/libs/ml/polynomial_features.md index 7226455..27fb1e9 100644 --- a/docs/libs/ml/polynomial_features.md +++ b/docs/libs/ml/polynomial_features.md @@ -1,7 +1,10 @@ --- mathjax: include -htmlTitle: FlinkML - Polynomial Features -title: <a href="../ml">FlinkML</a> - Polynomial Features +title: FlinkML - Polynomial Features +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: Polynomial Features --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/quickstart.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md index f5d7451..7b3c710 100644 --- a/docs/libs/ml/quickstart.md +++ b/docs/libs/ml/quickstart.md @@ -1,7 +1,10 @@ --- mathjax: include -htmlTitle: FlinkML - Quickstart Guide -title: <a href="../ml">FlinkML</a> - Quickstart Guide +title: FlinkML - Quickstart Guide +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: Quickstart Guide --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/standard_scaler.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/standard_scaler.md b/docs/libs/ml/standard_scaler.md index dea7e1d..f6d7b62 100644 --- a/docs/libs/ml/standard_scaler.md +++ b/docs/libs/ml/standard_scaler.md @@ -1,7 +1,10 @@ --- mathjax: include -htmlTitle: FlinkML - Standard Scaler -title: <a href="../ml">FlinkML</a> - Standard Scaler +title: FlinkML - Standard Scaler +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: Standard Scaler --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/ml/svm.md ---------------------------------------------------------------------- diff --git a/docs/libs/ml/svm.md b/docs/libs/ml/svm.md index c344979..b149d31 100644 --- a/docs/libs/ml/svm.md +++ b/docs/libs/ml/svm.md @@ -1,7 +1,10 @@ --- mathjax: include -htmlTitle: FlinkML - SVM using CoCoA -title: <a href="../ml">FlinkML</a> - SVM using CoCoA +title: FlinkML - SVM using CoCoA +# Sub navigation +sub-nav-group: batch +sub-nav-parent: flinkml +sub-nav-title: SVM (CoCoA) --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/libs/table.md ---------------------------------------------------------------------- diff --git a/docs/libs/table.md b/docs/libs/table.md index 1aedce3..74ffd3c 100644 --- a/docs/libs/table.md +++ b/docs/libs/table.md @@ -1,6 +1,15 @@ --- title: "Table API - Relational Queries" is_beta: true +# Top navigation +top-nav-group: libs +top-nav-pos: 3 +top-nav-title: "Relational: Table" +# Sub navigation +sub-nav-group: batch +sub-nav-parent: libs +sub-nav-pos: 3 +sub-nav-title: Table --- <!-- Licensed to the Apache Software Foundation (ASF) under one @@ -60,7 +69,7 @@ val result = expr.groupBy('word).select('word, 'count.sum as 'count).toDataSet[W The expression DSL uses Scala symbols to refer to field names and we use code generation to transform expressions to efficient runtime code. Please note that the conversion to and from Tables only works when using Scala case classes or Flink POJOs. Please check out -the [programming guide]({{ site.baseurl }}/apis/programming_guide.html) to learn the requirements for a class to be +the [programming guide]({{ site.baseurl }}/apis/index.html) to learn the requirements for a class to be considered a POJO. This is another example that shows how you @@ -387,4 +396,3 @@ Here, `literal` is a valid Java literal and `field reference` specifies a column column names follow Java identifier syntax. Only the types `LONG` and `STRING` can be casted to `DATE` and vice versa. A `LONG` casted to `DATE` must be a milliseconds timestamp. A `STRING` casted to `DATE` must have the format "`yyyy-MM-dd HH:mm:ss.SSS`", "`yyyy-MM-dd`", "`HH:mm:ss`", or a milliseconds timestamp. By default, all timestamps refer to the UTC timezone beginning from January 1, 1970, 00:00:00 in milliseconds. - http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/page/css/flink.css ---------------------------------------------------------------------- diff --git a/docs/page/css/flink.css b/docs/page/css/flink.css index 3b09e54..bcc398a 100644 --- a/docs/page/css/flink.css +++ b/docs/page/css/flink.css @@ -23,6 +23,7 @@ under the License. /* Padding at top because of the fixed navbar. */ body { padding-top: 70px; + padding-bottom: 50px; } /* Our logo. */ @@ -38,7 +39,10 @@ body { color: black; font-weight: bold; } -.navbar-default .navbar-nav > li > a:hover { +.navbar-default .navbar-nav > li > a:hover, +.dropdown-menu > .active > a, +.dropdown-menu > .active > a:hover { + color: #000000; background: #E7E7E7; } @@ -52,12 +56,14 @@ body { } /*============================================================================= - Navbar at the side of the page + Per page TOC =============================================================================*/ /* Move the side nav a little bit down to align with the main heading */ #markdown-toc { font-size: 90%; + padding-top: 16px; + padding-bottom: 16px; } /* Custom list styling */ @@ -71,17 +77,24 @@ body { /* All element */ #markdown-toc li > a { + color: #000; display: block; padding: 5px 10px; border: 1px solid #E5E5E5; - margin:-1px; + margin: -1px; } #markdown-toc li > a:hover, #markdown-toc li > a:focus { - text-decoration: none; + text-decoration: underline; background-color: #eee; } +@media (min-width: 768px) { + #markdown-toc > li { + width: 400px; + } +} + /* 1st-level elements */ #markdown-toc > li > a { font-weight: bold; @@ -102,30 +115,113 @@ body { } /*============================================================================= + Sub navigation (left side) +=============================================================================*/ + +/* Custom list styling */ +#sub-nav, #sub-nav ul { + list-style: none; + display: block; + position: relative; + padding-left: 0; + margin-bottom: 0; +} + +/* All elements */ +#sub-nav li > a { + display: block; + padding: 5px 10px; + border: 1px solid #e5e5e5; + margin: -1px; + color: #000000; +} + +#sub-nav li > a:hover, +#sub-nav li > a:focus { + text-decoration: underline; + background-color: #e7e7e7; +} + +/* 1st-level elements */ +#sub-nav > li > a { + background: #f8f8f8; + color: black; + font-weight: bold; +} + +/* 2nd-level element */ +#sub-nav > li li > a { + padding-left: 20px; /* A little more indentation*/ + background: #fff; +} + +/* >= 3rd-level element */ +#sub-nav > li li li { + display: none; /* hide */ +} + +#sub-nav a.active { + background: #e7e7e7; +} + +#sub-nav a.active:before { + content: '» '; +} + +/*============================================================================= Text =============================================================================*/ -h2, h3 { +h2 { + border-bottom: 1px solid #e5e5e5; +} + +h2, h3, h4, h5, h6, h7 { padding-top: 1em; - padding-bottom: 5px; - border-bottom: 1px solid #E5E5E5; } code { - background: #f5f5f5; - padding: 0; - color: #333333; + color: #000000; + background: #ffffff; + padding: 1px; font-family: "Menlo", "Lucida Console", monospace; } pre { - font-size: 85%; + background: #f7f7f7; + border: none; + font-size: 12px; + font-family: "Menlo", "Lucida Console", monospace; } img.center { display: block; margin-left: auto; - margin-right: auto; + margin-right: auto; +} + +a.top { + color: black; + text-decoration: none +} + +.text { + font-size: 14px; } +.beta { + font-weight: normal; + color: #333; +} + +table { + width: 100%; +} +th { + text-align: center; +} + +td { + padding: 5px; +} http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/quickstart/java_api_quickstart.md ---------------------------------------------------------------------- diff --git a/docs/quickstart/java_api_quickstart.md b/docs/quickstart/java_api_quickstart.md index 4d94396..ab1614d 100644 --- a/docs/quickstart/java_api_quickstart.md +++ b/docs/quickstart/java_api_quickstart.md @@ -1,5 +1,9 @@ --- title: "Quickstart: Java API" +# Top navigation +top-nav-group: quickstart +top-nav-pos: 3 +top-nav-title: Java API --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/quickstart/run_example_quickstart.md ---------------------------------------------------------------------- diff --git a/docs/quickstart/run_example_quickstart.md b/docs/quickstart/run_example_quickstart.md index 2d399a8..3fd2801 100644 --- a/docs/quickstart/run_example_quickstart.md +++ b/docs/quickstart/run_example_quickstart.md @@ -1,5 +1,9 @@ --- title: "Quick Start: Run K-Means Example" +# Top navigation +top-nav-group: quickstart +top-nav-pos: 2 +top-nav-title: Run Example --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/quickstart/scala_api_quickstart.md ---------------------------------------------------------------------- diff --git a/docs/quickstart/scala_api_quickstart.md b/docs/quickstart/scala_api_quickstart.md index 6155e41..28006c6 100644 --- a/docs/quickstart/scala_api_quickstart.md +++ b/docs/quickstart/scala_api_quickstart.md @@ -1,5 +1,9 @@ --- title: "Quickstart: Scala API" +# Top navigation +top-nav-group: quickstart +top-nav-pos: 4 +top-nav-title: Scala API --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/quickstart/setup_quickstart.md ---------------------------------------------------------------------- diff --git a/docs/quickstart/setup_quickstart.md b/docs/quickstart/setup_quickstart.md index 6fcb729..a9bc218 100644 --- a/docs/quickstart/setup_quickstart.md +++ b/docs/quickstart/setup_quickstart.md @@ -1,5 +1,9 @@ --- title: "Quickstart: Setup" +# Top navigation +top-nav-group: quickstart +top-nav-pos: 1 +top-nav-title: Setup --- <!-- Licensed to the Apache Software Foundation (ASF) under one http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/setup/building.md ---------------------------------------------------------------------- diff --git a/docs/setup/building.md b/docs/setup/building.md index 01600f9..8c1549e 100644 --- a/docs/setup/building.md +++ b/docs/setup/building.md @@ -1,5 +1,8 @@ --- -title: "Build Flink" +title: Building Flink +top-nav-group: setup +top-nav-pos: 1 +top-nav-title: Build Flink --- <!-- Licensed to the Apache Software Foundation (ASF) under one @@ -20,8 +23,16 @@ specific language governing permissions and limitations under the License. --> -In order to build Flink, you need the source code. Either download the source of a release or clone the git repository. In addition to that, you need Maven 3 and a JDK (Java Development Kit). -Flink requires at least Java 7 to build. We recommend using Java 8. +This page covers how to build Flink {{ site.version }} from sources. + +* This will be replaced by the TOC +{:toc} + +## Build Flink + +In order to build Flink you need the source code. Either [download the source of a release]({{ site.download_url }}) or [clone the git repository]({{ site.github_url }}). + +In addition you need **Maven 3** and a **JDK** (Java Development Kit). Flink requires **at least Java 7** to build. We recommend using Java 8. To clone from git, enter: @@ -32,87 +43,85 @@ git clone {{ site.github_url }} The simplest way of building Flink is by running: ~~~bash -cd flink mvn clean install -DskipTests ~~~ -This instructs Maven (`mvn`) to first remove all existing builds (`clean`) and then create a new Flink binary (`install`). The `-DskipTests` command prevents Maven from executing the unit tests. +This instructs [Maven](http://maven.apache.org) (`mvn`) to first remove all existing builds (`clean`) and then create a new Flink binary (`install`). The `-DskipTests` command prevents Maven from executing the tests. -[Read more](http://maven.apache.org/) about Apache Maven. +The default build includes the YARN Client for Hadoop 2. +{% top %} +## Hadoop Versions -## Build Flink for a specific Hadoop Version +{% info %} Most users do not need to do this manually. The [download page]({{ site.download_url }}) contains binary packages for common Hadoop versions. -This section covers building Flink for a specific Hadoop version. Most users do not need to do this manually. The download page of Flink contains binary packages for common setups. +Flink has dependencies to HDFS and YARN which are both dependencies from [Apache Hadoop](http://hadoop.apache.org). There exist many different versions of Hadoop (from both the upstream project and the different Hadoop distributions). If you are using a wrong combination of versions, exceptions can occur. -The problem is that Flink uses HDFS and YARN which are both dependencies from Apache Hadoop. There exist many different versions of Hadoop (from both the upstream project and the different Hadoop distributions). If a user is using a wrong combination of versions, exceptions like this one occur: +There are two main versions of Hadoop that we need to differentiate: +- **Hadoop 1**, with all versions starting with zero or one, like *0.20*, *0.23* or *1.2.1*. +- **Hadoop 2**, with all versions starting with 2, like *2.6.0*. -~~~bash -ERROR: Job execution failed. - org.apache.flink.runtime.client.JobExecutionException: Cannot initialize task 'TextInputFormat(/my/path)': - java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: - Protocol message contained an invalid tag (zero).; Host Details : -~~~ +The main differentiation between Hadoop 1 and Hadoop 2 is the availability of [Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), Hadoop's cluster resource manager. -There are two main versions of Hadoop that we need to differentiate: -- Hadoop 1, with all versions starting with zero or one, like 0.20, 0.23 or 1.2.1. -- Hadoop 2, with all versions starting with 2, like 2.2.0. -The main differentiation between Hadoop 1 and Hadoop 2 is the availability of Hadoop YARN (Hadoops cluster resource manager). +**By default, Flink is using the Hadoop 2 dependencies**. -By default, Flink is using the Hadoop 2 dependencies. +### Hadoop 1 -**To build Flink for Hadoop 1**, issue the following command: +To build Flink for Hadoop 1, issue the following command: ~~~bash mvn clean install -DskipTests -Dhadoop.profile=1 ~~~ -The `-Dhadoop.profile=1` flag instructs Maven to build Flink for Hadoop 1. Note that the features included in Flink change when using a different Hadoop profile. In particular the support for YARN and the build-in HBase support are not available in Hadoop 1 builds. +The `-Dhadoop.profile=1` flag instructs Maven to build Flink for Hadoop 1. Note that the features included in Flink change when using a different Hadoop profile. In particular, there is no support for YARN and HBase in Hadoop 1 builds. +### Hadoop 2 -You can also **specify a specific Hadoop version to build against**: +You can also specify a specific Hadoop version to build against: ~~~bash mvn clean install -DskipTests -Dhadoop.version=2.4.1 ~~~ +#### Before Hadoop 2.2.0 -**To build Flink against a vendor specific Hadoop version**, issue the following command: +Maven will automatically build Flink with its YARN client. The 2.2.0 Hadoop release is *not* supported by Flink's YARN client. Therefore, you need to exclude the YARN client with the following string: `-P!include-yarn`. + +So if you are building Flink for Hadoop `2.0.0-alpha`, use the following command: ~~~bash -mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=2.2.0-cdh5.0.0-beta-2 +mvn clean install -P!include-yarn -Dhadoop.version=2.0.0-alpha ~~~ -The `-Pvendor-repos` activates a Maven [build profile](http://maven.apache.org/guides/introduction/introduction-to-profiles.html) that includes the repositories of popular Hadoop vendors such as Cloudera, Hortonworks, or MapR. - -**Build Flink for `hadoop2` versions before 2.2.0** +### Vendor-specific Versions -Maven will automatically build Flink with its YARN client. But there were some changes in Hadoop versions before the 2.2.0 Hadoop release that are not supported by Flink's YARN client. Therefore, you can disable building the YARN client with the following string: `-P!include-yarn`. - -So if you are building Flink for Hadoop `2.0.0-alpha`, use the following command: +To build Flink against a vendor specific Hadoop version, issue the following command: ~~~bash --P!include-yarn -Dhadoop.version=2.0.0-alpha +mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=2.2.0-cdh5.0.0-beta-2 ~~~ -## Build Flink for a specific Scala Version +The `-Pvendor-repos` activates a Maven [build profile](http://maven.apache.org/guides/introduction/introduction-to-profiles.html) that includes the repositories of popular Hadoop vendors such as Cloudera, Hortonworks, or MapR. + +{% top %} -**Note:** Users that purely use the Java APIs and libraries can ignore this section. +## Scala Versions -Flink has APIs, libraries, and runtime modules written in [Scala](http://scala-lang.org). Users of the Scala API and libraries may have to match the Scala version of Flink with the Scala version -of their projects (because Scala is not strictly backwards compatible). +{% info %} Users that purely use the Java APIs and libraries can *ignore* this section. -By default, Flink is built with the Scala *2.10*. To build Flink with Scala *2.11*, you need to change the default Scala *binary version* with a build script: +Flink has APIs, libraries, and runtime modules written in [Scala](http://scala-lang.org). Users of the Scala API and libraries may have to match the Scala version of Flink with the Scala version of their projects (because Scala is not strictly backwards compatible). + +**By default, Flink is built with the Scala 2.10**. To build Flink with Scala *2.11*, you can change the default Scala *binary version* with the following script: ~~~bash # Switch Scala binary version between 2.10 and 2.11 tools/change-scala-version.sh 2.11 -# Build and install locally +# Build with Scala version 2.11 mvn clean install -DskipTests ~~~ -To build against custom Scala versions, you need to switch to the appropriate binary version and supply the *language version* as additional build property. For example, to buid against Scala 2.11.4, you have to execute: +To build against custom Scala versions, you need to switch to the appropriate binary version and supply the *language version* as an additional build property. For example, to build against Scala 2.11.4, you have to execute: ~~~bash # Switch Scala binary version to 2.11 @@ -125,12 +134,11 @@ Flink is developed against Scala *2.10* and tested additionally against Scala *2 Newer versions may be compatible, depending on breaking changes in the language features used by Flink, and the availability of Flink's dependencies in those Scala versions. The dependencies written in Scala include for example *Kafka*, *Akka*, *Scalatest*, and *scopt*. +{% top %} -## Building in encrypted filesystems +## Encrypted File Systems -If your home directory is encrypted you might encounter a `java.io.IOException: File -name too long` exception. Some encrypted file systems, like encfs used by Ubuntu, do not allow -long filenames, which is the cause of this error. +If your home directory is encrypted you might encounter a `java.io.IOException: File name too long` exception. Some encrypted file systems, like encfs used by Ubuntu, do not allow long filenames, which is the cause of this error. The workaround is to add: @@ -141,18 +149,16 @@ The workaround is to add: </args> ~~~ -in the compiler configuration of the `pom.xml` file of the module causing the error. For example, -if the error appears in the `flink-yarn` module, the above code should -be added under the `<configuration>` tag of `scala-maven-plugin`. See -[this issue](https://issues.apache.org/jira/browse/FLINK-2003) for more information. +in the compiler configuration of the `pom.xml` file of the module causing the error. For example, if the error appears in the `flink-yarn` module, the above code should be added under the `<configuration>` tag of `scala-maven-plugin`. See [this issue](https://issues.apache.org/jira/browse/FLINK-2003) for more information. + +{% top %} -## Background +## Internals -The builds with Maven are controlled by [properties](http://maven.apache.org/pom.html#Properties) and <a href="http://maven.apache.org/guides/introduction/introduction-to-profiles.html">build profiles</a>. -There are two profiles, one for hadoop1 and one for hadoop2. When the hadoop2 profile is enabled (default), the system will also build the YARN client. +The builds with Maven are controlled by [properties](http://maven.apache.org/pom.html#Properties) and [build profiles](http://maven.apache.org/guides/introduction/introduction-to-profiles.html). There are two profiles, one for `hadoop1` and one for `hadoop2`. When the `hadoop2` profile is enabled (default), the system will also build the YARN client. -To enable the hadoop1 profile, set `-Dhadoop.profile=1` when building. -Depending on the profile, there are two Hadoop versions, set via properties. For "hadoop1", we use 1.2.1 by default, for "hadoop2" it is 2.2.0. +To enable the `hadoop1` profile, set `-Dhadoop.profile=1` when building. Depending on the profile, there are two Hadoop versions, set via properties. For `hadoop1`, we use 1.2.1 by default, for `hadoop2` it is 2.3.0. You can change these versions with the `hadoop-two.version` (or `hadoop-one.version`) property. For example `-Dhadoop-two.version=2.4.0`. +{% top %} http://git-wip-us.apache.org/repos/asf/flink/blob/ad267a4b/docs/setup/cluster_setup.md ---------------------------------------------------------------------- diff --git a/docs/setup/cluster_setup.md b/docs/setup/cluster_setup.md index 3ee6630..8ff96d7 100644 --- a/docs/setup/cluster_setup.md +++ b/docs/setup/cluster_setup.md @@ -1,5 +1,8 @@ --- title: "Cluster Setup" +top-nav-group: deployment +top-nav-title: Cluster (Standalone) +top-nav-pos: 2 --- <!-- Licensed to the Apache Software Foundation (ASF) under one @@ -20,177 +23,53 @@ specific language governing permissions and limitations under the License. --> -This documentation is intended to provide instructions on how to run -Flink in a fully distributed fashion on a static (but possibly -heterogeneous) cluster. - -This involves two steps. First, installing and configuring Flink and -second installing and configuring the [Hadoop Distributed -Filesystem](http://hadoop.apache.org/) (HDFS). +This page provides instructions on how to run Flink in a *fully distributed fashion* on a *static* (but possibly heterogeneous) cluster. * This will be replaced by the TOC {:toc} -## Preparing the Cluster +## Requirements ### Software Requirements -Flink runs on all *UNIX-like environments*, e.g. **Linux**, **Mac OS X**, -and **Cygwin** (for Windows) and expects the cluster to consist of **one master -node** and **one or more worker nodes**. Before you start to setup the system, -make sure you have the following software installed **on each node**: +Flink runs on all *UNIX-like environments*, e.g. **Linux**, **Mac OS X**, and **Cygwin** (for Windows) and expects the cluster to consist of **one master node** and **one or more worker nodes**. Before you start to setup the system, make sure you have the following software installed **on each node**: - **Java 1.7.x** or higher, - **ssh** (sshd must be running to use the Flink scripts that manage remote components) -If your cluster does not fulfill these software requirements you will need to -install/upgrade it. - -For example, on Ubuntu Linux, type in the following commands to install Java and -ssh: - -~~~bash -sudo apt-get install ssh -sudo apt-get install openjdk-7-jre -~~~ - -You can check the correct installation of Java by issuing the following command: - -~~~bash -java -version -~~~ - -The command should output something comparable to the following on every node of -your cluster (depending on your Java version, there may be small differences): - -~~~bash -java version "1.7.0_55" -Java(TM) SE Runtime Environment (build 1.7.0_55-b13) -Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) -~~~ - -To make sure the ssh daemon is running properly, you can use the command - -~~~bash -ps aux | grep sshd -~~~ - -Something comparable to the following line should appear in the output -of the command on every host of your cluster: - -~~~bash -root 894 0.0 0.0 49260 320 ? Ss Jan09 0:13 /usr/sbin/sshd -~~~ - -### Configuring Remote Access with ssh - -In order to start/stop the remote processes, the master node requires access via -ssh to the worker nodes. It is most convenient to use ssh's public key -authentication for this. To setup public key authentication, log on to the -master as the user who will later execute all the Flink components. **The -same user (i.e. a user with the same user name) must also exist on all worker -nodes**. For the remainder of this instruction we will refer to this user as -*flink*. Using the super user *root* is highly discouraged for security -reasons. - -Once you logged in to the master node as the desired user, you must generate a -new public/private key pair. The following command will create a new -public/private key pair into the *.ssh* directory inside the home directory of -the user *flink*. See the ssh-keygen man page for more details. Note that -the private key is not protected by a passphrase. - -~~~bash -ssh-keygen -b 2048 -P '' -f ~/.ssh/id_rsa -~~~ - -Next, copy/append the content of the file *.ssh/id_rsa.pub* to your -authorized_keys file. The content of the authorized_keys file defines which -public keys are considered trustworthy during the public key authentication -process. On most systems the appropriate command is - -~~~bash -cat .ssh/id_rsa.pub >> .ssh/authorized_keys -~~~ - -On some Linux systems, the authorized keys file may also be expected by the ssh -daemon under *.ssh/authorized_keys2*. In either case, you should make sure the -file only contains those public keys which you consider trustworthy for each -node of cluster. +If your cluster does not fulfill these software requirements you will need to install/upgrade it. -Finally, the authorized keys file must be copied to every worker node of your -cluster. You can do this by repeatedly typing in - -~~~bash -scp .ssh/authorized_keys <worker>:~/.ssh/ -~~~ - -and replacing *\<worker\>* with the host name of the respective worker node. -After having finished the copy process, you should be able to log on to each -worker node from your master node via ssh without a password. - -### Setting JAVA_HOME on each Node - -Flink requires the `JAVA_HOME` environment variable to be set on the -master and all worker nodes and point to the directory of your Java -installation. - -You can set this variable in `conf/flink-conf.yaml` via the -`env.java.home` key. - -Alternatively, add the following line to your shell profile. If you use the -*bash* shell (probably the most common shell), the shell profile is located in -*\~/.bashrc*: - -~~~bash -export JAVA_HOME=/path/to/java_home/ -~~~ +{% top %} -If your ssh daemon supports user environments, you can also add `JAVA_HOME` to -*.\~/.ssh/environment*. As super user *root* you can enable ssh user -environments with the following commands: +### `JAVA_HOME` Configuration -~~~bash -echo "PermitUserEnvironment yes" >> /etc/ssh/sshd_config -/etc/init.d/ssh restart +Flink requires the `JAVA_HOME` environment variable to be set on the master and all worker nodes and point to the directory of your Java installation. -# on some system you might need to replace the above line with -/etc/init.d/sshd restart -~~~ +You can set this variable in `conf/flink-conf.yaml` via the `env.java.home` key. +{% top %} ## Flink Setup -Go to the [downloads page]({{site.baseurl}}/downloads.html) and get the ready to run -package. Make sure to pick the Flink package **matching your Hadoop -version**. +Go to the [downloads page]({{ site.download_url }}) and get the ready to run package. Make sure to pick the Flink package **matching your Hadoop version**. If you don't plan to use Hadoop, pick any version. -After downloading the latest release, copy the archive to your master node and -extract it: +After downloading the latest release, copy the archive to your master node and extract it: ~~~bash tar xzf flink-*.tgz cd flink-* ~~~ -### Configuring the Cluster +### Configuring Flink -After having extracted the system files, you need to configure Flink for -the cluster by editing *conf/flink-conf.yaml*. +After having extracted the system files, you need to configure Flink for the cluster by editing *conf/flink-conf.yaml*. -Set the `jobmanager.rpc.address` key to point to your master node. Furthermode -define the maximum amount of main memory the JVM is allowed to allocate on each -node by setting the `jobmanager.heap.mb` and `taskmanager.heap.mb` keys. +Set the `jobmanager.rpc.address` key to point to your master node. Furthermode define the maximum amount of main memory the JVM is allowed to allocate on each node by setting the `jobmanager.heap.mb` and `taskmanager.heap.mb` keys. -The value is given in MB. If some worker nodes have more main memory which you -want to allocate to the Flink system you can overwrite the default value -by setting an environment variable `FLINK_TM_HEAP` on the respective -node. +The value is given in MB. If some worker nodes have more main memory which you want to allocate to the Flink system you can overwrite the default value by setting an environment variable `FLINK_TM_HEAP` on the respective node. -Finally you must provide a list of all nodes in your cluster which shall be used -as worker nodes. Therefore, similar to the HDFS configuration, edit the file -*conf/slaves* and enter the IP/host name of each worker node. Each worker node -will later run a TaskManager. +Finally you must provide a list of all nodes in your cluster which shall be used as worker nodes. Therefore, similar to the HDFS configuration, edit the file *conf/slaves* and enter the IP/host name of each worker node. Each worker node will later run a TaskManager. Each entry must be separated by a new line, as in the following example: @@ -203,12 +82,9 @@ Each entry must be separated by a new line, as in the following example: 192.168.0.150 ~~~ -The Flink directory must be available on every worker under the same -path. Similarly as for HDFS, you can use a shared NSF directory, or copy the -entire Flink directory to every worker node. +The Flink directory must be available on every worker under the same path. You can use a shared NSF directory, or copy the entire Flink directory to every worker node. -Please see the [configuration page](config.html) for details and additional -configuration options. +Please see the [configuration page](config.html) for details and additional configuration options. In particular, @@ -219,14 +95,11 @@ In particular, are very important configuration values. +{% top %} ### Starting Flink -The following script starts a JobManager on the local node and connects via -SSH to all worker nodes listed in the *slaves* file to start the -TaskManager on each node. Now your Flink system is up and -running. The JobManager running on the local node will now accept jobs -at the configured RPC port. +The following script starts a JobManager on the local node and connects via SSH to all worker nodes listed in the *slaves* file to start the TaskManager on each node. Now your Flink system is up and running. The JobManager running on the local node will now accept jobs at the configured RPC port. Assuming that you are on the master node and inside the Flink directory: @@ -236,136 +109,24 @@ bin/start-cluster.sh To stop Flink, there is also a `stop-cluster.sh` script. -### Optional: Adding JobManager/TaskManager instances to a cluster +{% top %} -You can add both TaskManager or JobManager instances to your running cluster with the `bin/taskmanager.sh` and `bin/jobmanager.sh` scripts. +### Adding JobManager/TaskManager Instances to a Cluster -#### Adding a TaskManager -<pre> -bin/taskmanager.sh start|stop|stop-all -</pre> +You can add both JobManager and TaskManager instances to your running cluster with the `bin/taskmanager.sh` and `bin/jobmanager.sh` scripts. #### Adding a JobManager -<pre> -bin/jobmanager.sh (start (local|cluster))|stop|stop-all -</pre> - -Make sure to call these scripts on the hosts, on which you want to start/stop the respective instance. - - -## Optional: Hadoop Distributed Filesystem (HDFS) Setup - -**NOTE** Flink does not require HDFS to run; HDFS is simply a typical choice of a distributed data -store to read data from (in parallel) and write results to. -If HDFS is already available on the cluster, or Flink is used purely with different storage -techniques (e.g., Apache Kafka, JDBC, Rabbit MQ, or other storage or message queues), this -setup step is not needed. - - -The following instructions are a general overview of usual required settings. Please consult one of the -many installation guides available online for more detailed instructions. - -__Note that the following instructions are based on Hadoop 1.2 and might differ -for Hadoop 2.__ - -### Downloading, Installing, and Configuring HDFS - -Similar to the Flink system HDFS runs in a distributed fashion. HDFS -consists of a **NameNode** which manages the distributed file system's meta -data. The actual data is stored by one or more **DataNodes**. For the remainder -of this instruction we assume the HDFS's NameNode component runs on the master -node while all the worker nodes run an HDFS DataNode. - -To start, log on to your master node and download Hadoop (which includes HDFS) -from the Apache [Hadoop Releases](http://hadoop.apache.org/releases.html) page. - -Next, extract the Hadoop archive. - -After having extracted the Hadoop archive, change into the Hadoop directory and -edit the Hadoop environment configuration file: - -~~~bash -cd hadoop-* -vi conf/hadoop-env.sh -~~~ - -Uncomment and modify the following line in the file according to the path of -your Java installation. - -~~~ -export JAVA_HOME=/path/to/java_home/ -~~~ - -Save the changes and open the HDFS configuration file *conf/hdfs-site.xml*. HDFS -offers multiple configuration parameters which affect the behavior of the -distributed file system in various ways. The following excerpt shows a minimal -configuration which is required to make HDFS work. More information on how to -configure HDFS can be found in the [HDFS User -Guide](http://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.html) guide. - -~~~xml -<configuration> - <property> - <name>fs.default.name</name> - <value>hdfs://MASTER:50040/</value> - </property> - <property> - <name>dfs.data.dir</name> - <value>DATAPATH</value> - </property> -</configuration> -~~~ - -Replace *MASTER* with the IP/host name of your master node which runs the -*NameNode*. *DATAPATH* must be replaced with path to the directory in which the -actual HDFS data shall be stored on each worker node. Make sure that the -*flink* user has sufficient permissions to read and write in that -directory. - -After having saved the HDFS configuration file, open the file *conf/slaves* and -enter the IP/host name of those worker nodes which shall act as *DataNode*s. -Each entry must be separated by a line break. - -~~~ -<worker 1> -<worker 2> -. -. -. -<worker n> -~~~ - -Initialize the HDFS by typing in the following command. Note that the -command will **delete all data** which has been previously stored in the -HDFS. However, since we have just installed a fresh HDFS, it should be -safe to answer the confirmation with *yes*. ~~~bash -bin/hadoop namenode -format +bin/jobmanager.sh (start cluster)|stop|stop-all ~~~ -Finally, we need to make sure that the Hadoop directory is available to -all worker nodes which are intended to act as DataNodes and that all nodes -**find the directory under the same path**. We recommend to use a shared network -directory (e.g. an NFS share) for that. Alternatively, one can copy the -directory to all nodes (with the disadvantage that all configuration and -code updates need to be synced to all nodes). - -### Starting HDFS - -To start the HDFS log on to the master and type in the following -commands +#### Adding a TaskManager ~~~bash -cd hadoop-* -bin/start-dfs.sh +bin/taskmanager.sh start|stop|stop-all ~~~ -If your HDFS setup is correct, you should be able to open the HDFS -status website at *http://MASTER:50070*. In a matter of a seconds, -all DataNodes should appear as live nodes. For troubleshooting we would -like to point you to the [Hadoop Quick -Start](http://wiki.apache.org/hadoop/QuickStart) -guide. - +Make sure to call these scripts on the hosts, on which you want to start/stop the respective instance. +{% top %}