[GitHub] spark pull request #15578: Branch 2.0

wankunde Thu, 20 Oct 2016 20:27:03 -0700

GitHub user wankunde opened a pull request:

    https://github.com/apache/spark/pull/15578


    Branch 2.0

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wankunde/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15578.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15578
    
----
commit 72d9fba26c19aae73116fd0d00b566967934c6fc
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-09-22T11:35:54Z

    [SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for 
AFTSurvivalRegression
    
    ## What changes were proposed in this pull request?
    
    Add treeAggregateDepth parameter for AFTSurvivalRegression to keep 
consistent with LiR/LoR.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: WeichenXu <weichenxu...@outlook.com>
    
    Closes #14851 from 
WeichenXu123/add_treeAggregate_param_for_survival_regression.

commit 8a02410a92429bff50d6ce082f873cea9e9fa91e
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-09-22T15:25:32Z

    [SQL][MINOR] correct the comment of SortBasedAggregationIterator.safeProj
    
    ## What changes were proposed in this pull request?
    
    This comment went stale long time ago, this PR fixes it according to my 
understanding.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #15095 from cloud-fan/update-comment.

commit 17b72d31e0c59711eddeb525becb8085930eadcc
Author: Dhruve Ashar <das...@yahoo-inc.com>
Date:   2016-09-22T17:10:37Z

    [SPARK-17365][CORE] Remove/Kill multiple executors together to reduce RPC 
call time.
    
    ## What changes were proposed in this pull request?
    We are killing multiple executors together instead of iterating over 
expensive RPC calls to kill single executor.
    
    ## How was this patch tested?
    Executed sample spark job to observe executors being killed/removed with 
dynamic allocation enabled.
    
    Author: Dhruve Ashar <das...@yahoo-inc.com>
    Author: Dhruve Ashar <dhruveas...@gmail.com>
    
    Closes #15152 from dhruve/impr/SPARK-17365.

commit 9f24a17c59b1130d97efa7d313c06577f7344338
Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
Date:   2016-09-22T18:52:42Z

    Skip building R vignettes if Spark is not built
    
    ## What changes were proposed in this pull request?
    
    When we build the docs separately we don't have the JAR files from the 
Spark build in
    the same tree. As the SparkR vignettes need to launch a SparkContext to be 
built, we skip building them if JAR files don't exist
    
    ## How was this patch tested?
    
    To test this we can run the following:
    ```
    build/mvn -DskipTests -Psparkr clean
    ./R/create-docs.sh
    ```
    You should see a line `Skipping R vignettes as Spark JARs not found` at the 
end
    
    Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
    
    Closes #15200 from shivaram/sparkr-vignette-skip.

commit 85d609cf25c1da2df3cd4f5d5aeaf3cbcf0d674c
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-09-22T20:05:41Z

    [SPARK-17613] S3A base paths with no '/' at the end return empty DataFrames
    
    ## What changes were proposed in this pull request?
    
    Consider you have a bucket as `s3a://some-bucket`
    and under it you have files:
    ```
    s3a://some-bucket/file1.parquet
    s3a://some-bucket/file2.parquet
    ```
    Getting the parent path of `s3a://some-bucket/file1.parquet` yields
    `s3a://some-bucket/` and the ListingFileCatalog uses this as the key in the 
hash map.
    
    When catalog.allFiles is called, we use `s3a://some-bucket` (no slash at 
the end) to get the list of files, and we're left with an empty list!
    
    This PR fixes this by adding a `/` at the end of the `URI` iff the given 
`Path` doesn't have a parent, i.e. is the root. This is a no-op if the path 
already had a `/` at the end, and is handled through the Hadoop Path, path 
merging semantics.
    
    ## How was this patch tested?
    
    Unit test in `FileCatalogSuite`.
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #15169 from brkyvz/SPARK-17613.

commit 3cdae0ff2f45643df7bc198cb48623526c7eb1a6
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-09-22T21:26:45Z

    [SPARK-17638][STREAMING] Stop JVM StreamingContext when the Python process 
is dead
    
    ## What changes were proposed in this pull request?
    
    When the Python process is dead, the JVM StreamingContext is still running. 
Hence we will see a lot of Py4jException before the JVM process exits. It's 
better to stop the JVM StreamingContext to avoid those annoying logs.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15201 from zsxwing/stop-jvm-ssc.

commit 0d634875026ccf1eaf984996e9460d7673561f80
Author: Herman van Hovell <hvanhov...@databricks.com>
Date:   2016-09-22T21:29:27Z

    [SPARK-17616][SQL] Support a single distinct aggregate combined with a 
non-partial aggregate
    
    ## What changes were proposed in this pull request?
    We currently cannot execute an aggregate that contains a single distinct 
aggregate function and an one or more non-partially plannable aggregate 
functions, for example:
    ```sql
    select   grp,
             collect_list(col1),
             count(distinct col2)
    from     tbl_a
    group by 1
    ```
    This is a regression from Spark 1.6. This is caused by the fact that the 
single distinct aggregation code path assumes that all aggregates can be 
planned in two phases (is partially aggregatable). This PR works around this 
issue by triggering the `RewriteDistinctAggregates` in such cases (this is 
similar to the approach taken in 1.6).
    
    ## How was this patch tested?
    Created `RewriteDistinctAggregatesSuite` which checks if the aggregates 
with distinct aggregate functions get rewritten into two `Aggregates` and an 
`Expand`. Added a regression test to `DataFrameAggregateSuite`.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #15187 from hvanhovell/SPARK-17616.

commit f4f6bd8c9884e3919509907307fda774f56b5ecc
Author: Gayathri Murali <gayathri.m.sof...@gmail.com>
Date:   2016-09-22T23:34:42Z

    [SPARK-16240][ML] ML persistence backward compatibility for LDA
    
    ## What changes were proposed in this pull request?
    
    Allow Spark 2.x to load instances of LDA, LocalLDAModel, and 
DistributedLDAModel saved from Spark 1.6.
    
    ## How was this patch tested?
    
    I tested this manually, saving the 3 types from 1.6 and loading them into 
master (2.x).  In the future, we can add generic tests for testing backwards 
compatibility across all ML models in SPARK-15573.
    
    Author: Joseph K. Bradley <jos...@databricks.com>
    
    Closes #15034 from jkbradley/lda-backwards.

commit a1661968310de35e710e3b6784f63a77c44453fc
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-09-22T23:50:22Z

    [SPARK-17569][SPARK-17569][TEST] Make the unit test added for work again
    
    ## What changes were proposed in this pull request?
    
    A 
[PR](https://github.com/apache/spark/commit/a6aade0042d9c065669f46d2dac40ec6ce361e63)
 was merged concurrently that made the unit test for PR #15122 not test 
anything anymore. This PR fixes the test.
    
    ## How was this patch tested?
    
    Changed line 
https://github.com/apache/spark/blob/0d634875026ccf1eaf984996e9460d7673561f80/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L137
    from `false` to `true` and made sure the unit test failed.
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #15203 from brkyvz/fix-test.

commit 79159a1e87f19fb08a36857fc30b600ee7fdc52b
Author: Yucai Yu <yucai...@intel.com>
Date:   2016-09-23T00:22:56Z

    [SPARK-17635][SQL] Remove hardcode "agg_plan" in HashAggregateExec
    
    ## What changes were proposed in this pull request?
    
    "agg_plan" are hardcoded in HashAggregateExec, which have potential issue, 
so removing them.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Yucai Yu <yucai...@intel.com>
    
    Closes #15199 from yucai/agg_plan.

commit a4aeb7677bc07d0b83f82de62dcffd7867d19d9b
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2016-09-23T04:35:25Z

    [SPARK-17639][BUILD] Add jce.jar to buildclasspath when building.
    
    This was missing, preventing code that uses javax.crypto to properly
    compile in Spark.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #15204 from vanzin/SPARK-17639.

commit 947b8c6e3acd671d501f0ed6c077aac8e51ccede
Author: Joseph K. Bradley <jos...@databricks.com>
Date:   2016-09-23T05:27:28Z

    [SPARK-16719][ML] Random Forests should communicate fewer trees on each 
iteration
    
    ## What changes were proposed in this pull request?
    
    RandomForest currently sends the entire forest to each worker on each 
iteration. This is because (a) the node queue is FIFO and (b) the closure 
references the entire array of trees (topNodes). (a) causes RFs to handle 
splits in many trees, especially early on in learning. (b) sends all trees 
explicitly.
    
    This PR:
    (a) Change the RF node queue to be FILO (a stack), so that RFs tend to 
focus on 1 or a few trees before focusing on others.
    (b) Change topNodes to pass only the trees required on that iteration.
    
    ## How was this patch tested?
    
    Unit tests:
    * Existing tests for correctness of tree learning
    * Manually modifying code and running tests to verify that a small number 
of trees are communicated on each iteration
      * This last item is hard to test via unit tests given the current APIs.
    
    Author: Joseph K. Bradley <jos...@databricks.com>
    
    Closes #14359 from jkbradley/rfs-fewer-trees.

commit 62ccf27ab4b55e734646678ae78b7e812262d14b
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-09-23T06:35:08Z

    [SPARK-17640][SQL] Avoid using -1 as the default batchId for 
FileStreamSource.FileEntry
    
    ## What changes were proposed in this pull request?
    
    Avoid using -1 as the default batchId for FileStreamSource.FileEntry so 
that we can make sure not writing any FileEntry(..., batchId = -1) into the 
log. This also avoids people misusing it in future (#15203 is an example).
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15206 from zsxwing/cleanup.

commit 5c5396cb4725ba5ceee26ed885e8b941d219757b
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-09-23T08:41:50Z

    [BUILD] Closes some stale PRs
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to close some stale PRs and ones suggested to be closed by 
committer(s)
    
    Closes #12415
    Closes #14765
    Closes #15118
    Closes #15184
    Closes #15183
    Closes #9440
    Closes #15023
    Closes #14643
    Closes #14827
    
    ## How was this patch tested?
    
    N/A
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #15198 from HyukjinKwon/stale-prs.

commit 90d5754212425d55f992c939a2bc7d9ac6ef92b8
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-09-23T08:44:30Z

    [SPARK-16861][PYSPARK][CORE] Refactor PySpark accumulator API on top of 
Accumulator V2
    
    ## What changes were proposed in this pull request?
    
    Move the internals of the PySpark accumulator API from the old deprecated 
API on top of the new accumulator API.
    
    ## How was this patch tested?
    
    The existing PySpark accumulator tests (both unit tests and doc tests at 
the start of accumulator.py).
    
    Author: Holden Karau <hol...@us.ibm.com>
    
    Closes #14467 from holdenk/SPARK-16861-refactor-pyspark-accumulator-api.

commit f89808b0fdbc04e1bdff1489a6ec4c84ddb2adc4
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-09-23T18:14:22Z

    [SPARK-17499][SPARKR][ML][MLLIB] make the default params in sparkR 
spark.mlp consistent with MultilayerPerceptronClassifier
    
    ## What changes were proposed in this pull request?
    
    update `MultilayerPerceptronClassifierWrapper.fit` paramter type:
    `layers: Array[Int]`
    `seed: String`
    
    update several default params in sparkR `spark.mlp`:
    `tol` --> 1e-6
    `stepSize` --> 0.03
    `seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a 
`null` value and the seed will use the default one )
    r-side `seed` only support 32bit integer.
    
    remove `layers` default value, and move it in front of those parameters 
with default value.
    add `layers` parameter validation check.
    
    ## How was this patch tested?
    
    tests added.
    
    Author: WeichenXu <weichenxu...@outlook.com>
    
    Closes #15051 from WeichenXu123/update_py_mlp_default.

commit f62ddc5983a08d4d54c0a9a8210dd6cbec555671
Author: Jeff Zhang <zjf...@apache.org>
Date:   2016-09-23T18:37:43Z

    [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when 
running sparkr in RStudio
    
    ## What changes were proposed in this pull request?
    
    Spark will add sparkr.zip to archive only when it is yarn mode 
(SparkSubmit.scala).
    ```
        if (args.isR && clusterManager == YARN) {
          val sparkRPackagePath = RUtils.localSparkRPackagePath
          if (sparkRPackagePath.isEmpty) {
            printErrorAndExit("SPARK_HOME does not exist for R application in 
YARN mode.")
          }
          val sparkRPackageFile = new File(sparkRPackagePath.get, 
SPARKR_PACKAGE_ARCHIVE)
          if (!sparkRPackageFile.exists()) {
            printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R 
application in YARN mode.")
          }
          val sparkRPackageURI = 
Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString
    
          // Distribute the SparkR package.
          // Assigns a symbol link name "sparkr" to the shipped package.
          args.archives = mergeFileLists(args.archives, sparkRPackageURI + 
"#sparkr")
    
          // Distribute the R package archive containing all the built R 
packages.
          if (!RUtils.rPackages.isEmpty) {
            val rPackageFile =
              RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), 
R_PACKAGE_ARCHIVE)
            if (!rPackageFile.exists()) {
              printErrorAndExit("Failed to zip all the built R packages.")
            }
    
            val rPackageURI = 
Utils.resolveURI(rPackageFile.getAbsolutePath).toString
            // Assigns a symbol link name "rpkg" to the shipped package.
            args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
          }
        }
    ```
    So it is necessary to pass spark.master from R process to JVM. Otherwise 
sparkr.zip won't be distributed to executor.  Besides that I also pass 
spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need 
them to access secured cluster.
    
    ## How was this patch tested?
    
    Verify it manually in R Studio using the following code.
    ```
    Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
    .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
    library(SparkR)
    sparkR.session(master="yarn-client", sparkConfig = 
list(spark.executor.instances="1"))
    df <- as.DataFrame(mtcars)
    head(df)
    
    ```
    
    â¦
    
    Author: Jeff Zhang <zjf...@apache.org>
    
    Closes #14784 from zjffdu/SPARK-17210.

commit 988c71457354b0a443471f501cef544a85b1a76a
Author: Michael Armbrust <mich...@databricks.com>
Date:   2016-09-23T19:17:59Z

    [SPARK-17643] Remove comparable requirement from Offset
    
    For some sources, it is difficult to provide a global ordering based only 
on the data in the offset.  Since we don't use comparison for correctness, lets 
remove it.
    
    Author: Michael Armbrust <mich...@databricks.com>
    
    Closes #15207 from marmbrus/removeComparable.

commit 90a30f46349182b6fc9d4123090c4712fdb425be
Author: jisookim <jisookim0...@gmail.com>
Date:   2016-09-23T20:43:47Z

    [SPARK-12221] add cpu time to metrics
    
    Currently task metrics don't support executor CPU time, so there's no way 
to calculate how much CPU time a stage/task took from History Server metrics. 
This PR enables reporting CPU time.
    
    Author: jisookim <jisookim0...@gmail.com>
    
    Closes #10212 from jisookim0513/add-cpu-time-metric.

commit 7c382524a959a2bc9b3d2fca44f6f0b41aba4e3c
Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
Date:   2016-09-23T21:35:18Z

    [SPARK-17651][SPARKR] Set R package version number along with mvn
    
    ## What changes were proposed in this pull request?
    
    This PR sets the R package version while tagging releases. Note that since 
R doesn't accept `-SNAPSHOT` in version number field, we remove that while 
setting the next version
    
    ## How was this patch tested?
    
    Tested manually by running locally
    
    Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
    
    Closes #15223 from shivaram/sparkr-version-change.

commit f3fe55439e4c865c26502487a1bccf255da33f4a
Author: Sean Owen <so...@cloudera.com>
Date:   2016-09-24T07:06:41Z

    [SPARK-10835][ML] Word2Vec should accept non-null string array, in addition 
to existing null string array
    
    ## What changes were proposed in this pull request?
    
    To match Tokenizer and for compatibility with Word2Vec, output a nullable 
string array type in NGram
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15179 from srowen/SPARK-10835.

commit 248916f5589155c0c3e93c3874781f17b08d598d
Author: Sean Owen <so...@cloudera.com>
Date:   2016-09-24T07:15:55Z

    [SPARK-17057][ML] ProbabilisticClassifierModels' thresholds should have at 
most one 0
    
    ## What changes were proposed in this pull request?
    
    Match ProbabilisticClassifer.thresholds requirements to R randomForest 
cutoff, requiring all > 0
    
    ## How was this patch tested?
    
    Jenkins tests plus new test cases
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15149 from srowen/SPARK-17057.

commit 7945daed12542587d51ece8f07e5c828b40db14a
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-09-24T08:03:11Z

    [MINOR][SPARKR] Add sparkr-vignettes.html to gitignore.
    
    ## What changes were proposed in this pull request?
    Add ```sparkr-vignettes.html``` to ```.gitignore```.
    
    ## How was this patch tested?
    No need test.
    
    Author: Yanbo Liang <yblia...@gmail.com>
    
    Closes #15215 from yanboliang/ignore.

commit de333d121da4cb80d45819cbcf8b4246e48ec4d0
Author: xin wu <xi...@us.ibm.com>
Date:   2016-09-25T23:46:12Z

    [SPARK-17551][SQL] Add DataFrame API for null ordering
    
    ## What changes were proposed in this pull request?
    This pull request adds Scala/Java DataFrame API for null ordering (NULLS 
FIRST | LAST).
    
    Also did some minor clean up for related code (e.g. incorrect indentation), 
and renamed "orderby-nulls-ordering.sql" to be consistent with existing test 
files.
    
    ## How was this patch tested?
    Added a new test case in DataFrameSuite.
    
    Author: petermaxlee <petermax...@gmail.com>
    Author: Xin Wu <xi...@us.ibm.com>
    
    Closes #15123 from petermaxlee/SPARK-17551.

commit 59d87d24079bc633e63ce032f0a5ddd18a3b02cb
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-09-26T05:57:31Z

    [SPARK-17650] malformed url's throw exceptions before bricking Executors
    
    ## What changes were proposed in this pull request?
    
    When a malformed URL was sent to Executors through `sc.addJar` and 
`sc.addFile`, the executors become unusable, because they constantly throw 
`MalformedURLException`s and can never acknowledge that the file or jar is just 
bad input.
    
    This PR tries to fix that problem by making sure MalformedURLs can never be 
submitted through `sc.addJar` and `sc.addFile`. Another solution would be to 
blacklist bad files and jars on Executors. Maybe fail the first time, and then 
ignore the second time (but print a warning message).
    
    ## How was this patch tested?
    
    Unit tests in SparkContextSuite
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #15224 from brkyvz/SPARK-17650.

commit ac65139be96dbf87402b9a85729a93afd3c6ff17
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-09-26T08:45:33Z

    [SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python 
API.
    
    ## What changes were proposed in this pull request?
    #14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, 
however, it left some issue need to be addressed:
    * We should allow users to set selector type explicitly rather than 
switching them by using different setting function, since the setting order 
will involves some unexpected issue. For example, if users both set 
```numTopFeatures``` and ```percentile```, it will train ```kbest``` or 
```percentile``` model based on the order of setting (the latter setting one 
will be trained). This make users confused, and we should allow users to set 
selector type explicitly. We handle similar issues at other place of ML code 
base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
    * Meanwhile, if there are more than one parameter except ```alpha``` can be 
set for ```fpr``` model, we can not handle it elegantly in the existing 
framework. And similar issues for ```kbest``` and ```percentile``` model. 
Setting selector type explicitly can solve this issue also.
    * If setting selector type explicitly by users is allowed, we should handle 
param interaction such as if users set ```selectorType = percentile``` and 
```alpha = 0.1```, we should notify users the parameter ```alpha``` will take 
no effect. We should handle complex parameter interaction checks at 
```transformSchema```. (FYI #11620)
    * We should use lower case of the selector type names to follow MLlib 
convention.
    * Add ML Python API.
    
    ## How was this patch tested?
    Unit test.
    
    Author: Yanbo Liang <yblia...@gmail.com>
    
    Closes #15214 from yanboliang/spark-17017.

commit 50b89d05b7bffc212cc9b9ae6e0bca7cb90b9c77
Author: Justin Pihony <justin.pih...@gmail.com>
Date:   2016-09-26T08:54:22Z

    [SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc
    
    ## What changes were proposed in this pull request?
    
    This change modifies the implementation of DataFrameWriter.save such that 
it works with jdbc, and the call to jdbc merely delegates to save.
    
    ## How was this patch tested?
    
    This was tested via unit tests in the JDBCWriteSuite, of which I added one 
new test to cover this scenario.
    
    ## Additional details
    
    rxin This seems to have been most recently touched by you and was also 
commented on in the JIRA.
    
    This contribution is my original work and I license the work to the project 
under the project's open source license.
    
    Author: Justin Pihony <justin.pih...@gmail.com>
    Author: Justin Pihony <justin.pih...@typesafe.com>
    
    Closes #12601 from JustinPihony/jdbc_reconciliation.

commit f234b7cd795dd9baa3feff541c211b4daf39ccc6
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-09-26T11:19:39Z

    [SPARK-16356][ML] Add testImplicits for ML unit tests and promote toDF()
    
    ## What changes were proposed in this pull request?
    
    This was suggested in 
https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968.
    
    This PR adds `testImplicits` to `MLlibTestSparkContext` so that some 
implicits such as `toDF()` can be sued across ml tests.
    
    This PR also changes all the usages of `spark.createDataFrame( ... )` to 
`toDF()` where applicable in ml tests in Scala.
    
    ## How was this patch tested?
    
    Existing tests should work.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #14035 from HyukjinKwon/minor-ml-test.

commit bde85f8b70138a51052b613664facbc981378c38
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-09-26T17:44:35Z

    [SPARK-17649][CORE] Log how many Spark events got dropped in LiveListenerBus
    
    ## What changes were proposed in this pull request?
    
    Log how many Spark events got dropped in LiveListenerBus so that the user 
can get insights on how to set a correct event queue size.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15220 from zsxwing/SPARK-17649.

commit 8135e0e5ebdb9c7f5ac41c675dc8979a5127a31a
Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
Date:   2016-09-26T20:07:11Z

    [SPARK-17153][SQL] Should read partition data when reading new files in 
filestream without globbing
    
    ## What changes were proposed in this pull request?
    
    When reading file stream with non-globbing path, the results return data 
with all `null`s for the
    partitioned columns. E.g.,
    
        case class A(id: Int, value: Int)
        val data = spark.createDataset(Seq(
          A(1, 1),
          A(2, 2),
          A(2, 3))
        )
        val url = "/tmp/test"
        data.write.partitionBy("id").parquet(url)
        spark.read.parquet(url).show
    
        +-----+---+
        |value| id|
        +-----+---+
        |    2|  2|
        |    3|  2|
        |    1|  1|
        +-----+---+
    
        val s = 
spark.readStream.schema(spark.read.load(url).schema).parquet(url)
        s.writeStream.queryName("test").format("memory").start()
    
        sql("SELECT * FROM test").show
    
        +-----+----+
        |value|  id|
        +-----+----+
        |    2|null|
        |    3|null|
        |    1|null|
        +-----+----+
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
    Author: Liang-Chi Hsieh <vii...@gmail.com>
    
    Closes #14803 from viirya/filestreamsource-option.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15578: Branch 2.0

Reply via email to