Season of Docs 2020 Proposal for Apache Beam (Sruthi Sree Kumar)

Season of Docs Thu, 09 Jul 2020 19:48:32 -0700

Below is a project proposal from a technical writer (bcc'd) who wants to
work with your organization on a Season of Docs project. Please assess the
proposal and ensure that you have a mentor to work with the technical
writer.

If you want to accept the proposal, please submit the technical writing
project to the Season of Docs program administrators. The project selection
form is at this link: <https://bit.ly/gsod-tw-projectselection>. The form
is also available in the guide for organization administrators
<https://developers.google.com/season-of-docs/docs/admin-guide#tech-writer-application-phase>.

The deadline for project selections is July 31, 2020 at 20:00 UTC. For
other program deadlines, please see the full timeline
<https://developers.google.com/season-of-docs/docs/timeline> on the Season
of Docs website.

If you have any questions about the program, please email the Season of
Docs team at [email protected].

Best,
The Google Season of Docs team

Title: Update of the runner comparison page / capability matrix Project
length: Standard length (3 months)
Writer information *Name:* Sruthi Sree Kumar
*Email:* [email protected]
*Résumé/CV:*
https://drive.google.com/file/d/12RtM7Obz2Fog-AcIJAX1kLCKPPytY2Hq/view?usp=sharing
*Sample:* https://medium.com/big-data-processing
*Additional information:* I, Sruthi Sree Kumar, is a dual degree master
student in Cloud Computing and services. Currently, I am writing my master
thesis on Apache Flink state management API with Continuous Deep Analytics
research group at Research Institute of Sweden(RISE). Before my masters, I
have 4 years of work experience as a backend developer. I would like to
participate in the season of docs since I have found projects that are
related to my current work, area of interest and future career path.
Currently, I have been an active user of open source projects such as
Apache Beam and Apache Flink. Having said that, I also started a technical
blog earlier this year which has contents focussing on algorithms/concepts
in distributed systems and distributed processing systems.
Project Description Apache Beam is a unified platform for defining both
batch and stream processing pipelines. Apache Beam lets you define a model
to represent and transform datasets irrespective of any specific data
processing platform. Once defined, you can run it on any of the supported
run-time frameworks (runners) which includes Apache Apex, Apache Flink,
Apache Spark, and Google Cloud Dataflow. Apache Beam also comes with
different SDK’s which let you write your pipeline in programming languages
such as Java, python and GO.

I am submitting my application for the GSOD on “Update of the runner
comparison page/capability matrix”. As Apache Beam supports multiple
runners and SDK, a new user will be confused to choose between them. The
current documentation of different runners gives a very brief overview of
the runner. My idea is to add more comprehend details of each runner on the
particular runner documentation page. Also, I would like to update the
description of the example word count project to add a detailed
explanation. For this, my plan is to try every word count example locally
in my machine and find out if some steps are missing and add more
explanation on the process. Another thing which I have noticed is that the
documentation for the runners does not follow any pattern(Few has got an
overview section while others start with how to use or the prerequisite or
some random title). I will update all of them to follow a single simple
pattern.

I plan to add a new page to describe each runner and provide a descriptive
narration to each of them[BEAM-3220]. From this page, users can redirect to
the detailed description page of each runner and the capability matrix. I
also plan to add a descriptive comparison of each runner here. Currently, I
am using Beam NEXMark for benchmarking Flink runners for my master thesis.
As I am completely aware of NEXMark benchmarking, I would like to include
the benchmarking results of each runner in both batch and streaming mode
here(BEAM-2944). I would also update the NEXMark documentation if I find
out any parameters/ configuration are missing/removed. Before when I was
using Flink runner I was stuck initially as one of the parameters was
missing in the documentation [
https://lists.apache.org/thread.html/re71e8298e0c13180a4ab0ac6a65e808e1d82ce85e955778cf1089553%40%3Cuser.beam.apache.org%3E].
But now as I am more familiar with the NEXMark code base as well it would
be easier for me to benchmark the runners and add the metrics. In this same
page, I would like to include a brief summary of the production readiness
of each runner.

In the current documentation, the support for classic/portable runner is
included in each runner description page. I think it's also better to bring
them all at one place, either in the capability matrix or in the newly
added description page. Also, currently, the portability support is
maintained in a separate google sheet which I would like to merge to the
compatibility matrix.
https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0).
As part of this task, I plan to include all the major/minor corrections
which are mentioned in BEAM-2888.

I consider GSoD as an opportunity to step into open source contributions. I
will continue to contribute to open source projects especially Beam and
would like to continue as an active community member. As Apache Beam has
got an active community with continuous features being developed, I think
there is always a scope to improve the documentation to make it updated.
Also, I would like to contribute to the development work as well. If I have
sound knowledge in Beam, I can also help the user community as I always got
help from the community when I started with Beam.

I believe that I am the right person for this project because:

1. I am a distributed systems enthusiast who is trying to understand the
internals of data processing systems.
2. I have experience in working with Apache Beam and Apache Flink as a user.
3. I have already understood Apache Beam and Apache Flink code base as a
developer.
4. I have done a project to compare different beam runners.
5. I have experience in writing technical blogs to explain concepts of big
data processing and distributed systems.
6. Currently, I am working on my master thesis to improve the performance
of Apache Flink state backend for which I am using Apache Beam NEXMark
implementation for benchmarking and I have contributed to updating Apache
Beam documentation.
7. As I have 4 years of work experience as a software developer, I have
written multiple technical design documents and product documentation and
Readme files(which I do not have access right now).
8. I write documentation in such a way that anyone without previous
knowledge will understand it at first glance. {{EXTRA16}} {{EXTRA17}}

Season of Docs 2020 Proposal for Apache Beam (Sruthi Sree Kumar)

Reply via email to