[
https://issues.apache.org/jira/browse/HAMA-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15501973#comment-15501973
]
Edward J. Yoon commented on HAMA-983:
-------------------------------------
https://cloud.google.com/dataflow/examples/wordcount-example
This page is well-described about beam concept. The flow is like below:
{code}
Creating the Pipeline
Applying transforms to the Pipeline
Reading input (in this example: reading text files)
Applying ParDo transforms
Applying SDK-provided transforms (in this example: Count)
Writing output (in this example: writing to Google Cloud Storage)
Running the Pipeline
{code}
Once we created Hama pipeline we should able to run the program like below:
{code}
public static void main(String[] args) {
// Create a pipeline parameterized by commandline flags.
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));
p.apply(TextIO.Read.from("gs://...")) // Read input.
.apply(new CountWords()) // Do some processing.
.apply(TextIO.Write.to("gs://...")); // Write output.
// Run the pipeline.
p.run();
}
{code}
For I/O operations, you can refer this
https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/io/hadoop/HadoopIO.java
(instead of org.apache.hadoop.mapreduce.lib.input.FileInputFormat you should
use
https://github.com/apache/hama/blob/master/core/src/main/java/org/apache/hama/bsp/FileInputFormat.java)
{quote}BSP for dataflow could be similar to SuperstepBSP{quote}
I think so. GroupByKey seems a built-in processor that groups records by key.
We should implement it using a superstep.
> Hama runner for DataFlow
> ------------------------
>
> Key: HAMA-983
> URL: https://issues.apache.org/jira/browse/HAMA-983
> Project: Hama
> Issue Type: Bug
> Reporter: Edward J. Yoon
> Labels: gsoc2016
>
> As you already know, Apache Beam provides unified programming model for both
> batch and streaming inputs.
> The APIs are generally associated with data filtering and transforming. So
> we'll need to implement some data processing runner like
> https://github.com/dapurv5/MapReduce-BSP-Adapter/blob/master/src/main/java/org/apache/hama/mapreduce/examples/WordCount.java
> Also, implementing similarity join can be funny. According to
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf, Apache Hama is
> clearly winner among Apache Hadoop and Apache Spark.
> Since it consists of transformation, aggregation, and partition computations,
> I think it's possible to implement using Apache Beam APIs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)