[GitHub] spark pull request #21417: Branch 2.0

gentlewangyu Wed, 23 May 2018 22:35:53 -0700

GitHub user gentlewangyu opened a pull request:

    https://github.com/apache/spark/pull/21417


    Branch 2.0

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21417.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21417
    
----
commit 050b8177e27df06d33a6f6f2b3b6a952b0d03ba6
Author: cody koeninger <cody@...>
Date:   2016-10-12T22:22:06Z

    [SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of 
poll twice
    
    ## What changes were proposed in this pull request?
    
    Alternative approach to https://github.com/apache/spark/pull/15387
    
    Author: cody koeninger <c...@koeninger.org>
    
    Closes #15401 from koeninger/SPARK-17782-alt.
    
    (cherry picked from commit f9a56a153e0579283160519065c7f3620d12da3e)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 5903dabc57c07310573babe94e4f205bdea6455f
Author: Brian Cho <bcho@...>
Date:   2016-10-13T03:43:18Z

    [SPARK-16827][BRANCH-2.0] Avoid reporting spill metrics as shuffle metrics
    
    ## What changes were proposed in this pull request?
    
    Fix a bug where spill metrics were being reported as shuffle metrics. 
Eventually these spill metrics should be reported (SPARK-3577), but separate 
from shuffle metrics. The fix itself basically reverts the line to what it was 
in 1.6.
    
    ## How was this patch tested?
    
    Cherry-picked from master (#15347)
    
    Author: Brian Cho <b...@fb.com>
    
    Closes #15455 from dafrista/shuffle-metrics-2.0.

commit ab00e410c6b1d7dafdfabcea1f249c78459b94f0
Author: Burak Yavuz <brkyvz@...>
Date:   2016-10-13T04:40:45Z

    [SPARK-17876] Write StructuredStreaming WAL to a stream instead of 
materializing all at once
    
    ## What changes were proposed in this pull request?
    
    The CompactibleFileStreamLog materializes the whole metadata log in memory 
as a String. This can cause issues when there are lots of files that are being 
committed, especially during a compaction batch.
    You may come across stacktraces that look like:
    ```
    java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at java.lang.StringCoding.encode(StringCoding.java:350)
    at java.lang.String.getBytes(String.java:941)
    at 
org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
    
    ```
    The safer way is to write to an output stream so that we don't have to 
materialize a huge string.
    
    ## How was this patch tested?
    
    Existing unit tests
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #15437 from brkyvz/ser-to-stream.
    
    (cherry picked from commit edeb51a39d76d64196d7635f52be1b42c7ec4341)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit d38f38a093b4dff32c686675d93ab03e7a8f4908
Author: buzhihuojie <ren.weiluo@...>
Date:   2016-10-13T05:51:54Z

    minor doc fix for Row.scala
    
    ## What changes were proposed in this pull request?
    
    minor doc fix for "getAnyValAs" in class Row
    
    ## How was this patch tested?
    
    None.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Author: buzhihuojie <ren.wei...@gmail.com>
    
    Closes #15452 from david-weiluo-ren/minorDocFixForRow.
    
    (cherry picked from commit 7222a25a11790fa9d9d1428c84b6f827a785c9e8)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit d7fa3e32421c73adfa522adfeeb970edd4c22eb3
Author: Shixiong Zhu <shixiong@...>
Date:   2016-10-13T20:31:50Z

    [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource 
instead of counting on KafkaConsumer
    
    ## What changes were proposed in this pull request?
    
    Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR 
just calls `seekToBeginning` to manually set the earliest offsets for the 
KafkaSource initial offsets.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15397 from zsxwing/SPARK-17834.
    
    (cherry picked from commit 08eac356095c7faa2b19d52f2fb0cbc47eb7d1d1)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit c53b8374911e801ed98c1436c384f0aef076eaab
Author: Davies Liu <davies@...>
Date:   2016-10-14T21:45:20Z

    [SPARK-17863][SQL] should not add column into Distinct
    
    ## What changes were proposed in this pull request?
    
    We are trying to resolve the attribute in sort by pulling up some column 
for grandchild into child, but that's wrong when the child is Distinct, because 
the added column will change the behavior of Distinct, we should not do that.
    
    ## How was this patch tested?
    
    Added regression test.
    
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #15489 from davies/order_distinct.
    
    (cherry picked from commit da9aeb0fde589f7c21c2f4a32036a68c0353965d)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 2a1b10b649a8d4c077a0e19df976f1fd36b7e266
Author: Jun Kim <i2r.jun@...>
Date:   2016-10-15T07:36:55Z

    [SPARK-17953][DOCUMENTATION] Fix typo in SparkSession scaladoc
    
    ## What changes were proposed in this pull request?
    
    ### Before:
    ```scala
    SparkSession.builder()
         .master("local")
         .appName("Word Count")
         .config("spark.some.config.option", "some-value").
         .getOrCreate()
    ```
    
    ### After:
    ```scala
    SparkSession.builder()
         .master("local")
         .appName("Word Count")
         .config("spark.some.config.option", "some-value")
         .getOrCreate()
    ```
    
    There was one unexpected dot!
    
    Author: Jun Kim <i2r....@gmail.com>
    
    Closes #15498 from tae-jun/SPARK-17953.
    
    (cherry picked from commit 36d81c2c68ef4114592b069287743eb5cb078318)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 3cc2fe5b94d3bcdfb4f28bfa6d8e51fe67d6e1b4
Author: Dongjoon Hyun <dongjoon@...>
Date:   2016-10-17T05:15:47Z

    [SPARK-17819][SQL][BRANCH-2.0] Support default database in connection URIs 
for Spark Thrift Server
    
    ## What changes were proposed in this pull request?
    
    Currently, Spark Thrift Server ignores the default database in URI. This PR 
supports that like the following.
    
    ```sql
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "create database testdb"
    $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "create table t(a 
int)"
    $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "show tables"
    ...
    +------------+--------------+--+
    | tableName  | isTemporary  |
    +------------+--------------+--+
    | t          | false        |
    +------------+--------------+--+
    1 row selected (0.347 seconds)
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "show tables"
    ...
    +------------+--------------+--+
    | tableName  | isTemporary  |
    +------------+--------------+--+
    +------------+--------------+--+
    No rows selected (0.098 seconds)
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with a newly added testsuite.
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #15507 from dongjoon-hyun/SPARK-17819-BACK.

commit ca66f52ff81c19e17ca3733eac92d66012a3ec6e
Author: Weiqing Yang <yangweiqing001@...>
Date:   2016-10-17T05:38:30Z

    [MINOR][SQL] Add prettyName for current_database function
    
    ## What changes were proposed in this pull request?
    Added a `prettyname` for current_database function.
    
    ## How was this patch tested?
    Manually.
    
    Before:
    ```
    scala> sql("select current_database()").show
    +-----------------+
    |currentdatabase()|
    +-----------------+
    |          default|
    +-----------------+
    ```
    
    After:
    ```
    scala> sql("select current_database()").show
    +------------------+
    |current_database()|
    +------------------+
    |           default|
    +------------------+
    ```
    
    Author: Weiqing Yang <yangweiqing...@gmail.com>
    
    Closes #15506 from weiqingy/prettyName.
    
    (cherry picked from commit 56b0f5f4d1d7826737b81ebc4ec5dad83b6463e3)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit d1a02117862b20d0e8e58f4c6da6a97665a02590
Author: gatorsmile <gatorsmile@...>
Date:   2016-10-17T07:29:53Z

    [SPARK-17892][SQL][2.0] Do Not Optimize Query in CTAS More Than Once #15048
    
    ### What changes were proposed in this pull request?
    This PR is to backport https://github.com/apache/spark/pull/15048 and 
https://github.com/apache/spark/pull/15459.
    
    However, in 2.0, we do not have a unified logical node `CreateTable` and 
the analyzer rule `PreWriteCheck` is also different. To minimize the code 
changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it 
as a new PR to review. Thanks!
    
    As explained in https://github.com/apache/spark/pull/14797:
    >Some analyzer rules have assumptions on logical plans, optimizer may break 
these assumption, we should not pass an optimized query plan into 
QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    For example, we have a rule for decimal calculation to promote the 
precision before binary operations, use PromotePrecision as placeholder to 
indicate that this rule should not apply twice. But a Optimizer rule will 
remove this placeholder, that break the assumption, then the rule applied 
twice, cause wrong result.
    
    We should not optimize the query in CTAS more than once. For example,
    ```Scala
    spark.range(99, 101).createOrReplaceTempView("tab1")
    val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) 
as num FROM tab1"
    sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
    checkAnswer(spark.table("tab2"), sql(sqlStmt))
    ```
    Before this PR, the results do not match
    ```
    == Results ==
    !== Correct Answer - 2 ==       == Spark Answer - 2 ==
    ![100,100.000000000000000000]   [100,null]
     [99,99.000000000000000000]     [99,99.000000000000000000]
    ```
    After this PR, the results match.
    ```
    +---+----------------------+
    |id |num                   |
    +---+----------------------+
    |99 |99.000000000000000000 |
    |100|100.000000000000000000|
    +---+----------------------+
    ```
    
    In this PR, we do not treat the `query` in CTAS as a child. Thus, the 
`query` will not be optimized when optimizing CTAS statement. However, we still 
need to analyze it for normalizing and verifying the CTAS in the Analyzer. 
Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this 
rule needs the analyzed plan of the `query`.
    
    ### How was this patch tested?
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #15502 from gatorsmile/ctasOptimize2.0.

commit a0d9015b3f34582c5d43bd31bbf35a0e92b1da29
Author: Maxime Rihouey <maxime.rihouey@...>
Date:   2016-10-17T09:56:22Z

    Fix example of tf_idf with minDocFreq
    
    ## What changes were proposed in this pull request?
    
    The python example for tf_idf with the parameter "minDocFreq" is not 
properly set up because the same variable is used to transform the document for 
both with and without the "minDocFreq" parameter.
    The IDF(minDocFreq=2) is stored in the variable "idfIgnore" but then it is 
the original variable "idf" used to transform the "tf" instead of the 
"idfIgnore".
    
    ## How was this patch tested?
    
    Before the results for "tfidf" and "tfidfIgnore" were the same:
    tfidf:
    (1048576,[1046921],[3.75828890549])
    (1048576,[1046920],[3.75828890549])
    (1048576,[1046923],[3.75828890549])
    (1048576,[892732],[3.75828890549])
    (1048576,[892733],[3.75828890549])
    (1048576,[892734],[3.75828890549])
    tfidfIgnore:
    (1048576,[1046921],[3.75828890549])
    (1048576,[1046920],[3.75828890549])
    (1048576,[1046923],[3.75828890549])
    (1048576,[892732],[3.75828890549])
    (1048576,[892733],[3.75828890549])
    (1048576,[892734],[3.75828890549])
    
    After the fix those are how they should be:
    tfidf:
    (1048576,[1046921],[3.75828890549])
    (1048576,[1046920],[3.75828890549])
    (1048576,[1046923],[3.75828890549])
    (1048576,[892732],[3.75828890549])
    (1048576,[892733],[3.75828890549])
    (1048576,[892734],[3.75828890549])
    tfidfIgnore:
    (1048576,[1046921],[0.0])
    (1048576,[1046920],[0.0])
    (1048576,[1046923],[0.0])
    (1048576,[892732],[0.0])
    (1048576,[892733],[0.0])
    (1048576,[892734],[0.0])
    
    Author: Maxime Rihouey <maxime.riho...@gmail.com>
    
    Closes #15503 from maximerihouey/patch-1.
    
    (cherry picked from commit e3bf37fa3ada43624b2e77bef90ad3d3dbcd8ce1)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 881e0eb05782ea74cf92a62954466b14ea9e05b6
Author: Tathagata Das <tathagata.das1565@...>
Date:   2016-10-17T23:56:40Z

    [SPARK-17731][SQL][STREAMING] Metrics for structured streaming for 
branch-2.0
    
    **This PR adds the same metrics to branch-2.0 that was added to master in 
#15307.**
    
    The differences compared to the #15307 are
    - The new configuration is added only in the `SQLConf `object (i.e. 
`SQLConf.STREAMING_METRICS_ENABLED`) and not in the `SQLConf` class (i.e. no 
`SQLConf.isStreamingMetricsEnabled`). Spark master has all the streaming 
configurations exposed as actual fields in SQLConf class (e.g. 
[streamingPollingDelay](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L642)),
 but [not in Spark 
2.0](https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L608).
 So I didnt add it in this 2.0 PR.
    
    - In the previous master PR, the aboveconfiguration was read in 
`StreamExecution` as 
`sparkSession.sessionState.conf.isStreamingMetricsEnabled`. In this 2.0 PR, I 
am instead reading it as 
`sparkSession.conf.get(STREAMING_METRICS_ENABLED)`(i.e. no `sessionState`) to 
keep it consistent with how other confs are read in `StreamExecution` (e.g. 
[STREAMING_POLLING_DELAY](https://github.com/tdas/spark/blob/ee8e899e4c274c363a8b4d13e8bf57b0b467a50e/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L62)).
    
    - Different Mima exclusions
    
    ------
    
    ## What changes were proposed in this pull request?
    
    Metrics are needed for monitoring structured streaming apps. Here is the 
design doc for implementing the necessary metrics.
    
https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing
    
    Specifically, this PR adds the following public APIs changes.
    
    - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed 
from `StreamingQueryInfo`, see later)
    
    - `StreamingQueryStatus` has the following important fields
      - inputRate - Current rate (rows/sec) at which data is being generated by 
all the sources
      - processingRate - Current rate (rows/sec) at which the query is 
processing data from
                                      all the sources
      - ~~outputRate~~ - *Does not work with wholestage codegen*
      - latency - Current average latency between the data being available in 
source and the sink writing the corresponding output
      - sourceStatuses: Array[SourceStatus] - Current statuses of the sources
      - sinkStatus: SinkStatus - Current status of the sink
      - triggerStatus - Low-level detailed status of the last 
completed/currently active trigger
        - latencies - getOffset, getBatch, full trigger, wal writes
        - timestamps - trigger start, finish, after getOffset, after getBatch
        - numRows - input, output, state total/updated rows for aggregations
    
    - `SourceStatus` has the following important fields
      - inputRate - Current rate (rows/sec) at which data is being generated by 
the source
      - processingRate - Current rate (rows/sec) at which the query is 
processing data from the source
      - triggerStatus - Low-level detailed status of the last 
completed/currently active trigger
    
    - Python API for `StreamingQuery.status()`
    
    ### Breaking changes to existing APIs
    
    **Existing direct public facing APIs**
    - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and 
`StreamingQuery.sinkStatus` in favour of 
`StreamingQuery.status.sourceStatuses/sinkStatus`.
      - Branch 2.0 should have it deprecated, master should have it removed.
    
    **Existing advanced listener APIs**
    - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency 
with `SourceStatus`, `SinkStatus`
       - Earlier StreamingQueryInfo was used only in the advanced listener API, 
but now it is used in direct public-facing API (StreamingQuery.status)
    
    - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, 
`QueryTerminated` changed have name `queryStatus` and return type 
`StreamingQueryStatus`.
    
    - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to 
`String`.
    
    - For `SourceStatus` and `SinkStatus` made constructor private instead of 
private[sql] to make them more java-safe. Instead added `private[sql] object 
SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java.
    
    ## How was this patch tested?
    
    Old and new unit tests.
    - Rate calculation and other internal logic of StreamMetrics tested by 
StreamMetricsSuite.
    - New info in statuses returned through StreamingQueryListener is tested in 
StreamingQueryListenerSuite.
    - New and old info returned through StreamingQuery.status is tested in 
StreamingQuerySuite.
    - Source-specific tests for making sure input rows are counted are is 
source-specific test suites.
    - Additional tests to test minor additions in LocalTableScanExec, 
StateStore, etc.
    
    Metrics also manually tested using Ganglia sink
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #15472 from tdas/SPARK-17731-branch-2.0.

commit 01520de6b999c73b5e302778487d8bd1db8fdd2e
Author: Liwei Lin <lwlin7@...>
Date:   2016-10-18T07:49:57Z

    [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite
    
    This work has largely been done by lw-lin in his PR #15497. This is a 
slight refactoring of it.
    
    ## What changes were proposed in this pull request?
    There were two sources of flakiness in StreamingQueryListener test.
    
    - When testing with manual clock, consecutive attempts to advance the clock 
can occur without the stream execution thread being unblocked and doing some 
work between the two attempts. Hence the following can happen with the current 
ManualClock.
    ```
    +-----------------------------------+--------------------------------+
    |      StreamExecution thread       |         testing thread         |
    +-----------------------------------+--------------------------------+
    |  ManualClock.waitTillTime(100) {  |                                |
    |        _isWaiting = true          |                                |
    |            wait(10)               |                                |
    |        still in wait(10)          |  if (_isWaiting) advance(100)  |
    |        still in wait(10)          |  if (_isWaiting) advance(200)  | <- 
this should be disallowed !
    |        still in wait(10)          |  if (_isWaiting) advance(300)  | <- 
this should be disallowed !
    |      wake up from wait(10)        |                                |
    |       current time is 600         |                                |
    |       _isWaiting = false          |                                |
    |  }                                |                                |
    +-----------------------------------+--------------------------------+
    ```
    
    - Second source of flakiness is that the adding data to memory stream may 
get processing in any trigger, not just the first trigger.
    
    My fix is to make the manual clock wait for the other stream execution 
thread to start waiting for the clock at the right wait start time. That is, 
`advance(200)` (see above) will wait for stream execution thread to complete 
the wait that started at time 0, and start a new wait at time 200 (i.e. time 
stamp after the previous `advance(100)`).
    
    In addition, since this is a feature that is solely used by 
StreamExecution, I removed all the non-generic code from ManualClock and put 
them in StreamManualClock inside StreamTest.
    
    ## How was this patch tested?
    Ran existing unit test MANY TIME in Jenkins
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    Author: Liwei Lin <lwl...@gmail.com>
    
    Closes #15519 from tdas/metrics-flaky-test-fix.
    
    (cherry picked from commit 7d878cf2da04800bc4147b05610170865b148c64)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 9e806f2a4fbc0e7d1441a3eda375cba2fda8ffe5
Author: Tathagata Das <tathagata.das1565@...>
Date:   2016-10-18T09:29:55Z

    [SQL][STREAMING][TEST] Follow up to remove Option.contains for Scala 2.10 
compatibility
    
    ## What changes were proposed in this pull request?
    
    Scala 2.10 does not have Option.contains, which broke Scala 2.10 build.
    
    ## How was this patch tested?
    Locally compiled and ran sql/core unit tests in 2.10
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #15531 from tdas/metrics-flaky-test-fix-1.

commit 2aa25833c6f40af13a03a813b5f75d515f689577
Author: gatorsmile <gatorsmile@...>
Date:   2016-10-18T17:58:19Z

    [SPARK-17751][SQL][BACKPORT-2.0] Remove spark.sql.eagerAnalysis and Output 
the Plan if Existed in AnalysisException
    
    ### What changes were proposed in this pull request?
    This PR is to backport the fix https://github.com/apache/spark/pull/15316 
to 2.0.
    
    Dataset always does eager analysis now. Thus, `spark.sql.eagerAnalysis` is 
not used any more. Thus, we need to remove it.
    
    This PR also outputs the plan. Without the fix, the analysis error is like
    ```
    cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12
    ```
    
    After the fix, the analysis error becomes:
    ```
    org.apache.spark.sql.AnalysisException: cannot resolve '`k1`' given input 
columns: [k, v]; line 1 pos 12;
    'Project [unresolvedalias(CASE WHEN ('k1 = 2) THEN 22 WHEN ('k1 = 4) THEN 
44 ELSE 0 END, None), v#6]
    +- SubqueryAlias t
       +- Project [_1#2 AS k#5, _2#3 AS v#6]
          +- LocalRelation [_1#2, _2#3]
    ```
    
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #15529 from gatorsmile/eagerAnalysis20.

commit 26e978a93f029e1a1b5c7524d0b52c8141b70997
Author: Yu Peng <loneknightpy@...>
Date:   2016-10-18T20:23:31Z

    [SPARK-17711] Compress rolled executor log
    
    ## What changes were proposed in this pull request?
    
    This PR adds support for executor log compression.
    
    ## How was this patch tested?
    
    Unit tests
    
    cc: yhuai tdas mengxr
    
    Author: Yu Peng <loneknigh...@gmail.com>
    
    Closes #15285 from loneknightpy/compress-executor-log.
    
    (cherry picked from commit 231f39e3f6641953a90bc4c40444ede63f363b23)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 6ef9231377c7cce949dc7a988bb9d7a5cb3e458d
Author: Weiqing Yang <yangweiqing001@...>
Date:   2016-10-18T20:38:14Z

    [MINOR][DOC] Add more built-in sources in sql-programming-guide.md
    
    ## What changes were proposed in this pull request?
    Add more built-in sources in sql-programming-guide.md.
    
    ## How was this patch tested?
    Manually.
    
    Author: Weiqing Yang <yangweiqing...@gmail.com>
    
    Closes #15522 from weiqingy/dsDoc.
    
    (cherry picked from commit 20dd11096cfda51e47b9dbe3b715a12ccbb4ce1d)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit f6b87939cb90bf4a0996b3728c1bccdf5e24dd4e
Author: cody koeninger <cody@...>
Date:   2016-10-18T21:01:49Z

    [SPARK-17841][STREAMING][KAFKA] drain commitQueue
    
    ## What changes were proposed in this pull request?
    
    Actually drain commit queue rather than just iterating it.
    iterator() on a concurrent linked queue won't remove items from the queue, 
poll() will.
    
    ## How was this patch tested?
    Unit tests
    
    Author: cody koeninger <c...@koeninger.org>
    
    Closes #15407 from koeninger/SPARK-17841.
    
    (cherry picked from commit cd106b050ff789b6de539956a7f01159ab15c820)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 99943bf6905ca82a2c3e16e5d807fb572fa3dd3b
Author: Tathagata Das <tathagata.das1565@...>
Date:   2016-10-19T00:31:21Z

    [SPARK-17731][SQL][STREAMING][FOLLOWUP] Refactored StreamingQueryListener 
APIs for branch-2.0
    
    This is the branch-2.0 PR of #15530 to make the APIs consistent with the 
master. Since these APIs are experimental and not direct user facing 
(StreamingQueryListener is advanced Structured Streaming APIs), its okay to 
change them in branch-2.0.
    
    ## What changes were proposed in this pull request?
    
    As per rxin request, here are further API changes
    - Changed `Stream(Started/Progress/Terminated)` events to `Stream*Event`
    - Changed the fields in `StreamingQueryListener.on***` from `query*` to 
`event`
    
    ## How was this patch tested?
    Existing unit tests.
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #15535 from tdas/SPARK-17731-1-branch-2.0.

commit 3796a98cf3efad1dcbef536b295c7c47bf47d5dd
Author: Yu Peng <loneknightpy@...>
Date:   2016-10-19T02:43:08Z

    [SPARK-17711][TEST-HADOOP2.2] Fix hadoop2.2 compilation error
    
    ## What changes were proposed in this pull request?
    
    Fix hadoop2.2 compilation error.
    
    ## How was this patch tested?
    
    Existing tests.
    
    cc tdas zsxwing
    
    Author: Yu Peng <loneknigh...@gmail.com>
    
    Closes #15537 from loneknightpy/fix-17711.
    
    (cherry picked from commit 2629cd74602cfe77188b76428fed62a7a7149315)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit cdd2570e6dbfc5af68d0c9a49e4493e4e5e53020
Author: Tommy YU <tummyyu@...>
Date:   2016-10-19T04:15:32Z

    [SPARK-18001][DOCUMENT] fix broke link to SparkDataFrame
    
    ## What changes were proposed in this pull request?
    
    In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section 
"Untyped Dataset Operations (aka DataFrame Operations)"
    
    Link to R DataFrame doesn't work that return
    The requested URL /docs/latest/api/R/DataFrame.html was not found on this 
server.
    
    Correct link is SparkDataFrame.html for spark 2.0
    
    ## How was this patch tested?
    
    Manual checked.
    
    Author: Tommy YU <tumm...@163.com>
    
    Closes #15543 from Wenpei/spark-18001.
    
    (cherry picked from commit f39852e59883c214b0d007faffb406570ea3084b)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 995f602d27bdcf9e6787d93dbea2357e6dc6ccaa
Author: hyukjinkwon <gurwls223@...>
Date:   2016-10-20T02:36:21Z

    [SPARK-17989][SQL] Check ascendingOrder type in sort_array function rather 
than throwing ClassCastException
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to check the second argument, `ascendingOrder`  rather 
than throwing `ClassCastException` exception message.
    
    ```sql
    select sort_array(array('b', 'd'), '1');
    ```
    
    **Before**
    
    ```
    16/10/19 13:16:08 ERROR SparkSQLDriver: Failed in [select 
sort_array(array('b', 'd'), '1')]
    java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String 
cannot be cast to java.lang.Boolean
        at scala.runtime.BoxesRunTime.unboxToBoolean(BoxesRunTime.java:85)
        at 
org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:185)
        at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416)
        at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:50)
        at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:43)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:297)
    ```
    
    **After**
    
    ```
    Error in query: cannot resolve 'sort_array(array('b', 'd'), '1')' due to 
data type mismatch: Sort order in second argument requires a boolean literal.; 
line 1 pos 7;
    ```
    
    ## How was this patch tested?
    
    Unit test in `DataFrameFunctionsSuite`.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #15532 from HyukjinKwon/SPARK-17989.
    
    (cherry picked from commit 4b2011ec9da1245923b5cbd883240fef0dbf3ef0)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 4131623a8585fe99f79d82c24ab3b8b506d0d616
Author: WeichenXu <weichenxu123@...>
Date:   2016-10-20T06:41:38Z

    [SPARK-18003][SPARK CORE] Fix bug of RDD zipWithIndex & zipWithUniqueId 
index value overflowing
    
    ## What changes were proposed in this pull request?
    
    - Fix bug of RDD `zipWithIndex` generating wrong result when one partition 
contains more than 2147483647 records.
    
    - Fix bug of RDD `zipWithUniqueId` generating wrong result when one 
partition contains more than 2147483647 records.
    
    ## How was this patch tested?
    
    test added.
    
    Author: WeichenXu <weichenxu...@outlook.com>
    
    Closes #15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.
    
    (cherry picked from commit 39755169fb5bb07332eef263b4c18ede1528812d)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit e8923d21dd9f230e0ac23582033442e6fe476611
Author: jerryshao <sshao@...>
Date:   2016-10-20T17:50:34Z

    [SPARK-17999][KAFKA][SQL] Add getPreferredLocations for KafkaSourceRDD
    
    ## What changes were proposed in this pull request?
    
    The newly implemented Structured Streaming `KafkaSource` did calculate the 
preferred locations for each topic partition, but didn't offer this information 
through RDD's `getPreferredLocations` method. So here propose to add this 
method in `KafkaSourceRDD`.
    
    ## How was this patch tested?
    
    Manual verification.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #15545 from jerryshao/SPARK-17999.
    
    (cherry picked from commit 947f4f25273161dc4719419a35613a71c2e2a150)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 6cc6cb2a95cbf5db7d1f7392a9e64e58af7ebc73
Author: Felix Cheung <felixcheung_m@...>
Date:   2016-10-21T04:12:55Z

    [SPARKR] fix warnings
    
    ## What changes were proposed in this pull request?
    
    Fix for a bunch of test warnings that were added recently.
    We need to investigate why warnings are not turning into errors.
    
    ```
    Warnings 
-----------------------------------------------------------------------
    1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use 
Sepal_Length instead of Sepal.Length  as column name
    
    2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use 
Sepal_Width instead of Sepal.Width  as column name
    
    3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use 
Petal_Length instead of Petal.Length  as column name
    
    4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use 
Petal_Width instead of Petal.Width  as column name
    
    Consider adding
      importFrom("utils", "object.size")
    to your NAMESPACE file.
    ```
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Felix Cheung <felixcheun...@hotmail.com>
    
    Closes #15560 from felixcheung/rwarnings.
    
    (cherry picked from commit 3180272d2d49e440516085c0e4aebd5bad18bcad)
    Signed-off-by: Felix Cheung <felixche...@apache.org>

commit a65d40ab63fecc993136a98b8a820d2a8893a9ba
Author: Josh Rosen <joshrosen@...>
Date:   2016-10-21T18:25:01Z

    [SPARK-18034] Upgrade to MiMa 0.1.11 to fix flakiness
    
    We should upgrade to the latest release of MiMa (0.1.11) in order to 
include a fix for a bug which led to flakiness in the MiMa checks 
(https://github.com/typesafehub/migration-manager/issues/115).
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #15571 from JoshRosen/SPARK-18034.

commit 78458a7ebeba6758890b01cc2b7417ab2fda221e
Author: Hossein <hossein@...>
Date:   2016-10-21T19:38:52Z

    [SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date 
columns
    
    ## What changes were proposed in this pull request?
    NA date values are serialized as "NA" and NA time values are serialized as 
NaN from R. In the backend we did not have proper logic to deal with them. As a 
result we got an IllegalArgumentException for Date and wrong value for time. 
This PR adds support for deserializing NA as Date and Time.
    
    ## How was this patch tested?
    * [x] TODO
    
    Author: Hossein <hoss...@databricks.com>
    
    Closes #15421 from falaki/SPARK-17811.
    
    (cherry picked from commit e371040a0150e4ed748a7c25465965840b61ca63)
    Signed-off-by: Felix Cheung <felixche...@apache.org>

commit af2e6e0c9c85c40bc505ed1183857a8fb60fbd72
Author: Tathagata Das <tathagata.das1565@...>
Date:   2016-10-21T20:07:29Z

    [SPARK-17926][SQL][STREAMING] Added json for statuses
    
    ## What changes were proposed in this pull request?
    
    StreamingQueryStatus exposed through StreamingQueryListener often needs to 
be recorded (similar to SparkListener events). This PR adds `.json` and 
`.prettyJson` to `StreamingQueryStatus`, `SourceStatus` and `SinkStatus`.
    
    ## How was this patch tested?
    New unit tests
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #15476 from tdas/SPARK-17926.
    
    (cherry picked from commit 7a531e3054f8d4820216ed379433559f57f571b8)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit b113b5d9ff100385154ef0f836feb9805db163d2
Author: w00228970 <wangfei1@...>
Date:   2016-10-21T21:43:55Z

    [SPARK-17929][CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-17929
    
    Now `CoarseGrainedSchedulerBackend` reset will get the lock,
    ```
      protected def reset(): Unit = synchronized {
        numPendingExecutors = 0
        executorsPendingToRemove.clear()
    
        // Remove all the lingering executors that should be removed but not 
yet. The reason might be
        // because (1) disconnected event is not yet received; (2) executors 
die silently.
        executorDataMap.toMap.foreach { case (eid, _) =>
          driverEndpoint.askWithRetry[Boolean](
            RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
re-registered.")))
        }
      }
    ```
     but on removeExecutor also need the lock 
"CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock.
    
    ```
       private def removeExecutor(executorId: String, reason: 
ExecutorLossReason): Unit = {
          logDebug(s"Asked to remove executor $executorId with reason $reason")
          executorDataMap.get(executorId) match {
            case Some(executorInfo) =>
              // This must be synchronized because variables mutated
              // in this block are read when requesting executors
              val killed = CoarseGrainedSchedulerBackend.this.synchronized {
                addressToExecutorId -= executorInfo.executorAddress
                executorDataMap -= executorId
                executorsPendingLossReason -= executorId
                executorsPendingToRemove.remove(executorId).getOrElse(false)
              }
         ...
    
    ## How was this patch tested?
    
    manual test.
    
    Author: w00228970 <wangf...@huawei.com>
    
    Closes #15481 from scwf/spark-17929.
    
    (cherry picked from commit c1f344f1a09b8834bec70c1ece30b9bff63e55ea)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 3e9840f1d923a521d64bfc55fcbb6babd6045f06
Author: cody koeninger <cody@...>
Date:   2016-10-21T22:55:04Z

    [SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for 
structured stream
    
    ## What changes were proposed in this pull request?
    
    startingOffsets takes specific per-topicpartition offsets as a json 
argument, usable with any consumer strategy
    
    assign with specific topicpartitions as a consumer strategy
    
    ## How was this patch tested?
    
    Unit tests
    
    Author: cody koeninger <c...@koeninger.org>
    
    Closes #15504 from koeninger/SPARK-17812.
    
    (cherry picked from commit 268ccb9a48dfefc4d7bc85155e7e20a2dfe89307)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21417: Branch 2.0

Reply via email to