GitHub user souravaswal opened a pull request:
https://github.com/apache/spark/pull/19541
ABCD
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19541.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19541
commit 9e451bcf36151bf401f72dcd66001b9ceb079738
Author: Dongjoon Hyun
Date: 2017-09-05T21:35:09Z
[MINOR][DOC] Update `Partition Discovery` section to enumerate all
available file sources
## What changes were proposed in this pull request?
All built-in data sources support `Partition Discovery`. We had better
update the document to give the users more benefit clearly.
**AFTER**
https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png;>
## How was this patch tested?
```
SKIP_API=1 jekyll serve --watch
```
Author: Dongjoon Hyun
Closes #19139 from dongjoon-hyun/partitiondiscovery.
commit 6a2325448000ba431ba3b982d181c017559abfe3
Author: jerryshao
Date: 2017-09-06T01:39:39Z
[SPARK-18061][THRIFTSERVER] Add spnego auth support for ThriftServer
thrift/http protocol
Spark ThriftServer doesn't support spnego auth for thrift/http protocol,
this mainly used for knox+thriftserver scenario. Since in HiveServer2
CLIService there already has existing codes to support it. So here copy it to
Spark ThriftServer to make it support.
Related Hive JIRA HIVE-6697.
Manual verification.
Author: jerryshao
Closes #18628 from jerryshao/SPARK-21407.
Change-Id: I61ef0c09f6972bba982475084a6b0ae3a74e385e
commit 445f1790ade1c53cf7eee1f282395648e4d0992c
Author: jerryshao
Date: 2017-09-06T04:28:54Z
[SPARK-9104][CORE] Expose Netty memory metrics in Spark
## What changes were proposed in this pull request?
This PR exposes Netty memory usage for Spark's `TransportClientFactory` and
`TransportServer`, including the details of each direct arena and heap arena
metrics, as well as aggregated metrics. The purpose of adding the Netty metrics
is to better know the memory usage of Netty in Spark shuffle, rpc and others
network communications, and guide us to better configure the memory size of
executors.
This PR doesn't expose these metrics to any sink, to leverage this feature,
still requires to connect to either MetricsSystem or collect them back to
Driver to display.
## How was this patch tested?
Add Unit test to verify it, also manually verified in real cluster.
Author: jerryshao
Closes #18935 from jerryshao/SPARK-9104.
commit 4ee7dfe41b27abbd4c32074ecc8f268f6193c3f4
Author: Riccardo Corbella
Date: 2017-09-06T07:22:57Z
[SPARK-21924][DOCS] Update structured streaming programming guide doc
## What changes were proposed in this pull request?
Update the line "For example, the data (12:09, cat) is out of order and
late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For
example, the data (12:09, cat) is out of order and late, and it falls in
windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured
streaming programming guide.
Author: Riccardo Corbella
Closes #19137 from riccardocorbella/bugfix.
commit 16c4c03c71394ab30c8edaf4418973e1a2c5ebfe
Author: Bryan Cutler
Date: 2017-09-06T12:12:27Z
[SPARK-19357][ML] Adding parallel model evaluation in ML tuning
## What changes were proposed in this pull request?
Modified `CrossValidator` and `TrainValidationSplit` to be able to evaluate
models in parallel for a given parameter grid. The level of parallelism is
controlled by a parameter `numParallelEval` used to schedule a number of models
to be trained/evaluated so that the jobs can be run concurrently. This is a
naive approach that does not check the cluster for needed resources, so care
must be taken by the user to tune the parameter appropriately. The default
value is `1` which will train/evaluate in serial.
## How was this