[1/2] spark git commit: [MINOR][DOC] Fix some typos and grammar issues

gurwls223 Thu, 05 Apr 2018 22:38:38 -0700

Repository: spark
Updated Branches:
  refs/heads/master 249007e37 -> 6ade5cbb4



http://git-wip-us.apache.org/repos/asf/spark/blob/6ade5cbb/docs/structured-streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 9a83f15..602a4c7 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -8,7 +8,7 @@ title: Structured Streaming Programming Guide
 {:toc}
 
 # Overview
-Structured Streaming is a scalable and fault-tolerant stream processing engine 
built on the Spark SQL engine. You can express your streaming computation the 
same way you would express a batch computation on static data. The Spark SQL 
engine will take care of running it incrementally and continuously and updating 
the final result as streaming data continues to arrive. You can use the 
[Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java, Python or R 
to express streaming aggregations, event-time windows, stream-to-batch joins, 
etc. The computation is executed on the same optimized Spark SQL engine. 
Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees 
through checkpointing and Write Ahead Logs. In short, *Structured Streaming 
provides fast, scalable, fault-tolerant, end-to-end exactly-once stream 
processing without the user having to reason about streaming.*
+Structured Streaming is a scalable and fault-tolerant stream processing engine 
built on the Spark SQL engine. You can express your streaming computation the 
same way you would express a batch computation on static data. The Spark SQL 
engine will take care of running it incrementally and continuously and updating 
the final result as streaming data continues to arrive. You can use the 
[Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java, Python or R 
to express streaming aggregations, event-time windows, stream-to-batch joins, 
etc. The computation is executed on the same optimized Spark SQL engine. 
Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees 
through checkpointing and Write-Ahead Logs. In short, *Structured Streaming 
provides fast, scalable, fault-tolerant, end-to-end exactly-once stream 
processing without the user having to reason about streaming.*
 
 Internally, by default, Structured Streaming queries are processed using a 
*micro-batch processing* engine, which processes data streams as a series of 
small batch jobs thereby achieving end-to-end latencies as low as 100 
milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 
2.3, we have introduced a new low-latency processing mode called **Continuous 
Processing**, which can achieve end-to-end latencies as low as 1 millisecond 
with at-least-once guarantees. Without changing the Dataset/DataFrame 
operations in your queries, you will be able to choose the mode based on your 
application requirements. 
 
@@ -479,7 +479,7 @@ detail in the [Window 
Operations](#window-operations-on-event-time) section.
 
 ## Fault Tolerance Semantics
 Delivering end-to-end exactly-once semantics was one of key goals behind the 
design of Structured Streaming. To achieve that, we have designed the 
Structured Streaming sources, the sinks and the execution engine to reliably 
track the exact progress of the processing so that it can handle any kind of 
failure by restarting and/or reprocessing. Every streaming source is assumed to 
have offsets (similar to Kafka offsets, or Kinesis sequence numbers)
-to track the read position in the stream. The engine uses checkpointing and 
write ahead logs to record the offset range of the data being processed in each 
trigger. The streaming sinks are designed to be idempotent for handling 
reprocessing. Together, using replayable sources and idempotent sinks, 
Structured Streaming can ensure **end-to-end exactly-once semantics** under any 
failure.
+to track the read position in the stream. The engine uses checkpointing and 
write-ahead logs to record the offset range of the data being processed in each 
trigger. The streaming sinks are designed to be idempotent for handling 
reprocessing. Together, using replayable sources and idempotent sinks, 
Structured Streaming can ensure **end-to-end exactly-once semantics** under any 
failure.
 
 # API using Datasets and DataFrames
 Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, 
as well as streaming, unbounded data. Similar to static Datasets/DataFrames, 
you can use the common entry point `SparkSession`
@@ -690,7 +690,7 @@ These examples generate streaming DataFrames that are 
untyped, meaning that the
 
 By default, Structured Streaming from file based sources requires you to 
specify the schema, rather than rely on Spark to infer it automatically. This 
restriction ensures a consistent schema will be used for the streaming query, 
even in the case of failures. For ad-hoc use cases, you can reenable schema 
inference by setting `spark.sql.streaming.schemaInference` to `true`.
 
-Partition discovery does occur when subdirectories that are named 
`/key=value/` are present and listing will automatically recurse into these 
directories. If these columns appear in the user provided schema, they will be 
filled in by Spark based on the path of the file being read. The directories 
that make up the partitioning scheme must be present when the query starts and 
must remain static. For example, it is okay to add `/data/year=2016/` when 
`/data/year=2015/` was present, but it is invalid to change the partitioning 
column (i.e. by creating the directory `/data/date=2016-04-17/`).
+Partition discovery does occur when subdirectories that are named 
`/key=value/` are present and listing will automatically recurse into these 
directories. If these columns appear in the user-provided schema, they will be 
filled in by Spark based on the path of the file being read. The directories 
that make up the partitioning scheme must be present when the query starts and 
must remain static. For example, it is okay to add `/data/year=2016/` when 
`/data/year=2015/` was present, but it is invalid to change the partitioning 
column (i.e. by creating the directory `/data/date=2016-04-17/`).
 
 ## Operations on streaming DataFrames/Datasets
 You can apply all kinds of operations on streaming DataFrames/Datasets â 
ranging from untyped, SQL-like operations (e.g. `select`, `where`, `groupBy`), 
to typed RDD-like operations (e.g. `map`, `filter`, `flatMap`). See the [SQL 
programming guide](sql-programming-guide.html) for more details. Letâs take a 
look at a few example operations that you can use.
@@ -2661,7 +2661,7 @@ sql("SET spark.sql.streaming.metricsEnabled=true")
 All queries started in the SparkSession after this configuration has been 
enabled will report metrics through Dropwizard to whatever 
[sinks](monitoring.html#metrics) have been configured (e.g. Ganglia, Graphite, 
JMX, etc.).
 
 ## Recovering from Failures with Checkpointing 
-In case of a failure or intentional shutdown, you can recover the previous 
progress and state of a previous query, and continue where it left off. This is 
done using checkpointing and write ahead logs. You can configure a query with a 
checkpoint location, and the query will save all the progress information (i.e. 
range of offsets processed in each trigger) and the running aggregates (e.g. 
word counts in the [quick example](#quick-example)) to the checkpoint location. 
This checkpoint location has to be a path in an HDFS compatible file system, 
and can be set as an option in the DataStreamWriter when [starting a 
query](#starting-streaming-queries).
+In case of a failure or intentional shutdown, you can recover the previous 
progress and state of a previous query, and continue where it left off. This is 
done using checkpointing and write-ahead logs. You can configure a query with a 
checkpoint location, and the query will save all the progress information (i.e. 
range of offsets processed in each trigger) and the running aggregates (e.g. 
word counts in the [quick example](#quick-example)) to the checkpoint location. 
This checkpoint location has to be a path in an HDFS compatible file system, 
and can be set as an option in the DataStreamWriter when [starting a 
query](#starting-streaming-queries).
 
 <div class="codetabs">
 <div data-lang="scala"  markdown="1">

http://git-wip-us.apache.org/repos/asf/spark/blob/6ade5cbb/docs/submitting-applications.md
----------------------------------------------------------------------
diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md
index a3643bf..77aa083 100644
--- a/docs/submitting-applications.md
+++ b/docs/submitting-applications.md
@@ -177,7 +177,7 @@ The master URL passed to Spark can be in one of the 
following formats:
 # Loading Configuration from a File
 
 The `spark-submit` script can load default [Spark configuration 
values](configuration.html) from a
-properties file and pass them on to your application. By default it will read 
options
+properties file and pass them on to your application. By default, it will read 
options
 from `conf/spark-defaults.conf` in the Spark directory. For more detail, see 
the section on
 [loading default 
configurations](configuration.html#loading-default-configurations).
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6ade5cbb/docs/tuning.md
----------------------------------------------------------------------
diff --git a/docs/tuning.md b/docs/tuning.md
index fc27713..912c398 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -196,7 +196,7 @@ To further tune garbage collection, we first need to 
understand some basic infor
 
 * A simplified description of the garbage collection procedure: When Eden is 
full, a minor GC is run on Eden and objects
   that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor 
regions are swapped. If an object is old
-  enough or Survivor2 is full, it is moved to Old. Finally when Old is close 
to full, a full GC is invoked.
+  enough or Survivor2 is full, it is moved to Old. Finally, when Old is close 
to full, a full GC is invoked.
 
 The goal of GC tuning in Spark is to ensure that only long-lived RDDs are 
stored in the Old generation and that
 the Young generation is sufficiently sized to store short-lived objects. This 
will help avoid full GCs to collect

http://git-wip-us.apache.org/repos/asf/spark/blob/6ade5cbb/python/README.md
----------------------------------------------------------------------
diff --git a/python/README.md b/python/README.md
index 3f17fdb..2e0112d 100644
--- a/python/README.md
+++ b/python/README.md
@@ -22,7 +22,7 @@ This packaging is currently experimental and may change in 
future versions (alth
 Using PySpark requires the Spark JARs, and if you are building this from 
source please see the builder instructions at
 ["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
 
-The Python packaging for Spark is not intended to replace all of the other use 
cases. This Python packaged version of Spark is suitable for interacting with 
an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not 
contain the tools required to setup your own standalone Spark cluster. You can 
download the full version of Spark from the [Apache Spark downloads 
page](http://spark.apache.org/downloads.html).
+The Python packaging for Spark is not intended to replace all of the other use 
cases. This Python packaged version of Spark is suitable for interacting with 
an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not 
contain the tools required to set up your own standalone Spark cluster. You can 
download the full version of Spark from the [Apache Spark downloads 
page](http://spark.apache.org/downloads.html).
 
 
 **NOTE:** If you are using this with a Spark standalone cluster you must 
ensure that the version (including minor version) matches or you may experience 
odd errors.

http://git-wip-us.apache.org/repos/asf/spark/blob/6ade5cbb/sql/README.md
----------------------------------------------------------------------
diff --git a/sql/README.md b/sql/README.md
index fe1d352..70cc7c6 100644
--- a/sql/README.md
+++ b/sql/README.md
@@ -6,7 +6,7 @@ This module provides support for executing relational queries 
expressed in eithe
 Spark SQL is broken up into four subprojects:
  - Catalyst (sql/catalyst) - An implementation-agnostic framework for 
manipulating trees of relational operators and expressions.
  - Execution (sql/core) - A query planner / execution engine for translating 
Catalyst's logical query plans into Spark RDDs.  This component also includes a 
new public interface, SQLContext, that allows users to execute SQL or LINQ 
statements against existing RDDs and Parquet files.
- - Hive Support (sql/hive) - Includes an extension of SQLContext called 
HiveContext that allows users to write queries using a subset of HiveQL and 
access data from a Hive Metastore using Hive SerDes.  There are also wrappers 
that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
+ - Hive Support (sql/hive) - Includes an extension of SQLContext called 
HiveContext that allows users to write queries using a subset of HiveQL and 
access data from a Hive Metastore using Hive SerDes. There are also wrappers 
that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  - HiveServer and CLI support (sql/hive-thriftserver) - Includes support for 
the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.
 
 Running `sql/create-docs.sh` generates SQL documentation for built-in 
functions under `sql/site`.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[1/2] spark git commit: [MINOR][DOC] Fix some typos and grammar issues

Reply via email to