[GitHub] spark pull request #21763: Branch 2.1

rajesh7738 Fri, 13 Jul 2018 09:59:07 -0700

GitHub user rajesh7738 opened a pull request:

    https://github.com/apache/spark/pull/21763


    Branch 2.1

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21763.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21763
    
----
commit 664c9795c94d3536ff9fe54af06e0fb6c0012862
Author: Shixiong Zhu <shixiong@...>
Date:   2017-03-04T03:00:35Z

    [SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite doesn't 
recover the log level
    
    ## What changes were proposed in this pull request?
    
    "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
running after it won't output any logs except fatal logs.
    
    This PR uses `testQuietly` instead to avoid changing the log level.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #17156 from zsxwing/SPARK-19816.
    
    (cherry picked from commit fbc4058037cf5b0be9f14a7dd28105f7f8151bed)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit ca7a7e8a893a30d85e4315a4fa1ca1b1c56a703c
Author: uncleGen <hustyugm@...>
Date:   2017-03-06T02:17:30Z

    [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not 
filter checkpointFilesOfLatestTime with the PATH string.
    
    ## What changes were proposed in this pull request?
    
    
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73800/testReport/
    
    ```
    sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code
    passed to eventually never returned normally. Attempted 617 times over 
10.003740484 seconds.
    Last failure message: 8 did not equal 2.
        at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
        at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
        at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
        at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
        at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
        at 
org.apache.spark.streaming.DStreamCheckpointTester$class.generateOutput(CheckpointSuite
    .scala:172)
        at 
org.apache.spark.streaming.CheckpointSuite.generateOutput(CheckpointSuite.scala:211)
    ```
    
    the check condition is:
    
    ```
    val checkpointFilesOfLatestTime = 
Checkpoint.getCheckpointFiles(checkpointDir).filter {
         _.toString.contains(clock.getTimeMillis.toString)
    }
    // Checkpoint files are written twice for every batch interval. So assert 
that both
    // are written to make sure that both of them have been written.
    assert(checkpointFilesOfLatestTime.size === 2)
    ```
    
    the path string may contain the `clock.getTimeMillis.toString`, like `3500` 
:
    
    ```
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-500
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1000
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1500
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2000
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2500
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3000
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500.bk
    
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500
                                                           â²â²â²â²
    ```
    
    so we should only check the filename, but not the whole path.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: uncleGen <husty...@gmail.com>
    
    Closes #17167 from uncleGen/flaky-CheckpointSuite.
    
    (cherry picked from commit 207067ead6db6dc87b0d144a658e2564e3280a89)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit fd6c6d5c363008a229759bf628edc0f6c5e00ade
Author: Tyson Condie <tcondie@...>
Date:   2017-03-07T00:39:05Z

    [SPARK-19719][SS] Kafka writer for both structured streaming and batch 
queires
    
    ## What changes were proposed in this pull request?
    
    Add a new Kafka Sink and Kafka Relation for writing streaming and batch 
queries, respectively, to Apache Kafka.
    ### Streaming Kafka Sink
    - When addBatch is called
    -- If batchId is great than the last written batch
    --- Write batch to Kafka
    ---- Topic will be taken from the record, if present, or from a topic 
option, which overrides topic in record.
    -- Else ignore
    
    ### Batch Kafka Sink
    - KafkaSourceProvider will implement CreatableRelationProvider
    - CreatableRelationProvider#createRelation will write the passed in 
Dataframe to a Kafka
    - Topic will be taken from the record, if present, or from topic option, 
which overrides topic in record.
    - Save modes Append and ErrorIfExist supported under identical semantics. 
Other save modes result in an AnalysisException
    
    tdas zsxwing
    
    ## How was this patch tested?
    
    ### The following unit tests will be included
    - write to stream with topic field: valid stream write with data that 
includes an existing topic in the schema
    - write structured streaming aggregation w/o topic field, with default 
topic: valid stream write with data that does not include a topic field, but 
the configuration includes a default topic
    - write data with bad schema: various cases of writing data that does not 
conform to a proper schema e.g., 1. no topic field or default topic, and 2. no 
value field
    - write data with valid schema but wrong types: data with a complete schema 
but wrong types e.g., key and value types are integers.
    - write to non-existing topic: write a stream to a topic that does not 
exist in Kafka, which has been configured to not auto-create topics.
    - write batch to kafka: simple write batch to Kafka, which goes through the 
same code path as streaming scenario, so validity checks will not be redone 
here.
    
    ### Examples
    ```scala
    // Structured Streaming
    val writer = inputStringStream.map(s => 
s.get(0).toString.getBytes()).toDF("value")
     .selectExpr("value as key", "value as value")
     .writeStream
     .format("kafka")
     .option("checkpointLocation", checkpointDir)
     .outputMode(OutputMode.Append)
     .option("kafka.bootstrap.servers", brokerAddress)
     .option("topic", topic)
     .queryName("kafkaStream")
     .start()
    
    // Batch
    val df = spark
     .sparkContext
     .parallelize(Seq("1", "2", "3", "4", "5"))
     .map(v => (topic, v))
     .toDF("topic", "value")
    
    df.write
     .format("kafka")
     .option("kafka.bootstrap.servers",brokerAddress)
     .option("topic", topic)
     .save()
    ```
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Tyson Condie <tcon...@gmail.com>
    
    Closes #17043 from tcondie/kafka-writer.

commit 711addd46e98e42deca97c5b9c0e55fddebaa458
Author: Jason White <jason.white@...>
Date:   2017-03-07T21:14:37Z

    [SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long
    
    ## What changes were proposed in this pull request?
    
    Cast the output of `TimestampType.toInternal` to long to allow for proper 
Timestamp creation in DataFrames near the epoch.
    
    ## How was this patch tested?
    
    Added a new test that fails without the change.
    
    dongjoon-hyun davies Mind taking a look?
    
    The contribution is my original work and I license the work to the project 
under the projectâs open source license.
    
    Author: Jason White <jason.wh...@shopify.com>
    
    Closes #16896 from JasonMWhite/SPARK-19561.
    
    (cherry picked from commit 6f4684622a951806bebe7652a14f7d1ce03e24c7)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit 551b7bdbe00b9ee803baa18e6b4690c478af9161
Author: Marcelo Vanzin <vanzin@...>
Date:   2017-03-08T00:21:18Z

    [SPARK-19857][YARN] Correctly calculate next credential update time.
    
    Add parentheses so that both lines form a single statement; also add
    a log message so that the issue becomes more explicit if it shows up
    again.
    
    Tested manually with integration test that exercises the feature.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #17198 from vanzin/SPARK-19857.
    
    (cherry picked from commit 8e41c2eed873e215b13215844ba5ba73a8906c5b)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit cbc37007aa07991135a3da13ad566be76a0ef577
Author: Wenchen Fan <wenchen@...>
Date:   2017-03-08T01:15:39Z

    Revert "[SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long"
    
    This reverts commit 6f4684622a951806bebe7652a14f7d1ce03e24c7.

commit 3b648a62626850470f8cceea3f0ec5dfd46e4e33
Author: Shixiong Zhu <shixiong@...>
Date:   2017-03-08T04:34:55Z

    [SPARK-19859][SS] The new watermark should override the old one
    
    ## What changes were proposed in this pull request?
    
    The new watermark should override the old one. Otherwise, we just pick up 
the first column which has a watermark, it may be unexpected.
    
    ## How was this patch tested?
    
    The new test.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #17199 from zsxwing/SPARK-19859.
    
    (cherry picked from commit d8830c5039d9c7c5ef03631904c32873ab558e22)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 0ba9ecbea88533b2562f2f6045eafeab99d8f0c6
Author: Bryan Cutler <cutlerb@...>
Date:   2017-03-08T04:44:30Z

    [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe
    
    ## What changes were proposed in this pull request?
    The `keyword_only` decorator in PySpark is not thread-safe.  It writes 
kwargs to a static class variable in the decorator, which is then retrieved 
later in the class method as `_input_kwargs`.  If multiple threads are 
constructing the same class with different kwargs, it becomes a race condition 
to read from the static class variable before it's overwritten.  See 
[SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for 
reproduction code.
    
    This change will write the kwargs to a member variable so that multiple 
threads can operate on separate instances without the race condition.  It does 
not protect against multiple threads operating on a single instance, but that 
is better left to the user to synchronize.
    
    ## How was this patch tested?
    Added new unit tests for using the keyword_only decorator and a regression 
test that verifies `_input_kwargs` can be overwritten from different class 
instances.
    
    Author: Bryan Cutler <cutl...@gmail.com>
    
    Closes #17193 from 
BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348-2_1.

commit 320eff14b0bb634eba2cdcae2303ba38fd0eb282
Author: Michael Armbrust <michael@...>
Date:   2017-03-08T09:32:42Z

    [SPARK-18055][SQL] Use correct mirror in ExpresionEncoder
    
    Previously, we were using the mirror of passed in `TypeTag` when reflecting 
to build an encoder.  This fails when the outer class is built in (i.e. `Seq`'s 
default mirror is based on root classloader) but inner classes (i.e. `A` in 
`Seq[A]`) are defined in the REPL or a library.
    
    This patch changes us to always reflect based on a mirror created using the 
context classloader.
    
    Author: Michael Armbrust <mich...@databricks.com>
    
    Closes #17201 from marmbrus/replSeqEncoder.
    
    (cherry picked from commit 314e48a3584bad4b486b046bbf0159d64ba857bc)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit f6c1ad2eb6d0706899aabbdd39e558b3488e2ef3
Author: Burak Yavuz <brkyvz@...>
Date:   2017-03-08T22:35:07Z

    [SPARK-19813] maxFilesPerTrigger combo latestFirst may miss old files in 
combination with maxFileAge in FileStreamSource
    
    ## What changes were proposed in this pull request?
    
    **The Problem**
    There is a file stream source option called maxFileAge which limits how old 
the files can be, relative the latest file that has been seen. This is used to 
limit the files that need to be remembered as "processed". Files older than the 
latest processed files are ignored. This values is by default 7 days.
    This causes a problem when both
    latestFirst = true
    maxFilesPerTrigger > total files to be processed.
    Here is what happens in all combinations
    1) latestFirst = false - Since files are processed in order, there wont be 
any unprocessed file older than the latest processed file. All files will be 
processed.
    2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
not, then all old files get processed in the first batch, and so no file is 
left behind.
    3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
process the latest X files. That sets the threshold latest file - maxFileAge, 
so files older than this threshold will never be considered for processing.
    The bug is with case 3.
    
    **The Solution**
    
    Ignore `maxFileAge` when both `maxFilesPerTrigger` and `latestFirst` are 
set.
    
    ## How was this patch tested?
    
    Regression test in `FileStreamSourceSuite`
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #17153 from brkyvz/maxFileAge.
    
    (cherry picked from commit a3648b5d4f99ff9461d02f53e9ec71787a3abf51)
    Signed-off-by: Burak Yavuz <brk...@gmail.com>

commit 3457c32297e0150a4fbc80a30f84b9c62ca7c372
Author: Shixiong Zhu <shixiong@...>
Date:   2017-03-08T22:30:54Z

    Revert "[SPARK-19413][SS] MapGroupsWithState for arbitrary stateful 
operations for branch-2.1"
    
    This reverts commit 502c927b8c8a99ef2adf4e6e1d7a6d9232d45ef5.

commit 78cc5721f07af5c561e89d1bbc72975bb67abb74
Author: Dilip Biswal <dbiswal@...>
Date:   2017-03-09T01:33:49Z

    [MINOR][SQL] The analyzer rules are fired twice for cases when 
AnalysisException is raised from analyzer.
    
    ## What changes were proposed in this pull request?
    In general we have a checkAnalysis phase which validates the logical plan 
and throws AnalysisException on semantic errors. However we also can throw 
AnalysisException from a few analyzer rules like ResolveSubquery.
    
    I found that we fire up the analyzer rules twice for the queries that throw 
AnalysisException from one of the analyzer rules. This is a very minor fix. We 
don't have to strictly fix it. I just got confused seeing the rule getting 
fired two times when i was not expecting it.
    
    ## How was this patch tested?
    
    Tested manually.
    
    Author: Dilip Biswal <dbis...@us.ibm.com>
    
    Closes #17214 from dilipbiswal/analyis_twice.
    
    (cherry picked from commit d809ceed9762d5bbb04170e45f38751713112dd8)
    Signed-off-by: Xiao Li <gatorsm...@gmail.com>

commit 00859e148fd1002fa314542953fee61a5d0fb9d9
Author: Shixiong Zhu <shixiong@...>
Date:   2017-03-09T07:15:52Z

    [SPARK-19874][BUILD] Hide API docs for org.apache.spark.sql.internal
    
    ## What changes were proposed in this pull request?
    
    The API docs should not include the "org.apache.spark.sql.internal" package 
because they are internal private APIs.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #17217 from zsxwing/SPARK-19874.
    
    (cherry picked from commit 029e40b412e332c9f0fff283d604e203066c78c0)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 0c140c1682262bc27df94952bda6ad8e3229fda4
Author: uncleGen <hustyugm@...>
Date:   2017-03-09T07:23:10Z

    [SPARK-19859][SS][FOLLOW-UP] The new watermark should override the old one.
    
    ## What changes were proposed in this pull request?
    
    A follow up to SPARK-19859:
    
    - extract the calculation of `delayMs` and reuse it.
    - update EventTimeWatermarkExec
    - use the correct `delayMs` in EventTimeWatermark
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: uncleGen <husty...@gmail.com>
    
    Closes #17221 from uncleGen/SPARK-19859.
    
    (cherry picked from commit eeb1d6db878641d9eac62d0869a90fe80c1f4461)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 2a76e2420f6976bd2ef2fcf7b8c8db1f0d37c1ad
Author: Jason White <jason.white@...>
Date:   2017-03-09T18:34:54Z

    [SPARK-19561][SQL] add int case handling for TimestampType
    
    ## What changes were proposed in this pull request?
    
    Add handling of input of type `Int` for dataType `TimestampType` to 
`EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger 
than MAX_INT to Long, which are handled correctly already, but values between 
MIN_INT and MAX_INT are serialized to Int.
    
    These range limits correspond to roughly half an hour on either side of the 
epoch. As a result, PySpark doesn't allow TimestampType values to be created in 
this range.
    
    Alternatives attempted: patching the `TimestampType.toInternal` function to 
cast return values to `long`, so Py4J would always serialize them to Scala 
Long. Python3 does not have a `long` type, so this approach failed on Python3.
    
    ## How was this patch tested?
    
    Added a new PySpark-side test that fails without the change.
    
    The contribution is my original work and I license the work to the project 
under the projectâs open source license.
    
    Resubmission of https://github.com/apache/spark/pull/16896. The original PR 
didn't go through Jenkins and broke the build. davies dongjoon-hyun
    
    cloud-fan Could you kick off a Jenkins run for me? It passed everything for 
me locally, but it's possible something has changed in the last few weeks.
    
    Author: Jason White <jason.wh...@shopify.com>
    
    Closes #17200 from JasonMWhite/SPARK-19561.
    
    (cherry picked from commit 206030bd12405623c00c1ff334663984b9250adb)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit ffe65b06511f3143cb2549073bfbe145663ad561
Author: uncleGen <hustyugm@...>
Date:   2017-03-09T19:07:31Z

    [SPARK-19861][SS] watermark should not be a negative time.
    
    ## What changes were proposed in this pull request?
    
    `watermark` should not be negative. This behavior is invalid, check it 
before real run.
    
    ## How was this patch tested?
    
    add new unit test.
    
    Author: uncleGen <husty...@gmail.com>
    Author: dylon <husty...@gmail.com>
    
    Closes #17202 from uncleGen/SPARK-19861.
    
    (cherry picked from commit 30b18e69361746b4d656474374d8b486bb48a19e)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit a59cc369f57cfd4fc8f2a7177c9519731c71c63a
Author: Burak Yavuz <brkyvz@...>
Date:   2017-03-10T01:42:10Z

    [SPARK-19886] Fix reportDataLoss if statement in SS KafkaSource
    
    ## What changes were proposed in this pull request?
    
    Fix the `throw new IllegalStateException` if statement part.
    
    ## How is this patch tested
    
    Regression test
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #17228 from brkyvz/kafka-cause-fix.
    
    (cherry picked from commit 82138e09b9ad8d9609d5c64d6c11244b8f230be7)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit f0d50fd547c247df06470d68cd5245e3027e89a2
Author: Tyson Condie <tcondie@...>
Date:   2017-03-10T07:02:13Z

    [SPARK-19891][SS] Await Batch Lock notified on stream execution exit
    
    ## What changes were proposed in this pull request?
    
    We need to notify the await batch lock when the stream exits early e.g., 
when an exception has been thrown.
    
    ## How was this patch tested?
    
    Current tests that throw exceptions at runtime will finish faster as a 
result of this update.
    
    zsxwing
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Tyson Condie <tcon...@gmail.com>
    
    Closes #17231 from tcondie/kafka-writer.
    
    (cherry picked from commit 501b7111997bc74754663348967104181b43319b)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 5a2ad4312dd00a450eac49ce53d70d9541e9e4cb
Author: Wenchen Fan <wenchen@...>
Date:   2017-03-11T00:14:22Z

    [SPARK-19893][SQL] should not run DataFrame set oprations with map type
    
    In spark SQL, map type can't be used in equality test/comparison, and 
`Intersect`/`Except`/`Distinct` do need equality test for all columns, we 
should not allow map type in `Intersect`/`Except`/`Distinct`.
    
    new regression test
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #17236 from cloud-fan/map.
    
    (cherry picked from commit fb9beda54622e0c3190c6504fc468fa4e50eeb45)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit e481a73819213e4a7919e14e979b79a65098224f
Author: Budde <budde@...>
Date:   2017-03-11T00:38:16Z

    [SPARK-19611][SQL] Introduce configurable table schema inference
    
    Add a new configuration option that allows Spark SQL to infer a 
case-sensitive schema from a Hive Metastore table's data files when a 
case-sensitive schema can't be read from the table properties.
    
    - Add spark.sql.hive.caseSensitiveInferenceMode param to SQLConf
    - Add schemaPreservesCase field to CatalogTable (set to false when schema 
can't
      successfully be read from Hive table props)
    - Perform schema inference in HiveMetastoreCatalog if schemaPreservesCase is
      false, depending on spark.sql.hive.caseSensitiveInferenceMode
    - Add alterTableSchema() method to the ExternalCatalog interface
    - Add HiveSchemaInferenceSuite tests
    - Refactor and move ParquetFileForamt.meregeMetastoreParquetSchema() as
      HiveMetastoreCatalog.mergeWithMetastoreSchema
    - Move schema merging tests from ParquetSchemaSuite to 
HiveSchemaInferenceSuite
    
    [JIRA for this change](https://issues.apache.org/jira/browse/SPARK-19611)
    
    The tests in ```HiveSchemaInferenceSuite``` should verify that schema 
inference is working as expected. ```ExternalCatalogSuite``` has also been 
extended to cover the new ```alterTableSchema()``` API.
    
    Author: Budde <bu...@amazon.com>
    
    Closes #17229 from budde/SPARK-19611-2.1.

commit f9833c66a2f11414357854dae00e9e2448869254
Author: uncleGen <hustyugm@...>
Date:   2017-03-12T08:29:37Z

    [DOCS][SS] fix structured streaming python example
    
    ## What changes were proposed in this pull request?
    
    - SS python example: `TypeError: 'xxx' object is not callable`
    - some other doc issue.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: uncleGen <husty...@gmail.com>
    
    Closes #17257 from uncleGen/docs-ss-python.
    
    (cherry picked from commit e29a74d5b1fa3f9356b7af5dd7e3fce49bc8eb7d)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 8c460804698742b0405d6c7e8a1880472f436f9e
Author: uncleGen <hustyugm@...>
Date:   2017-03-13T00:46:31Z

    [SPARK-19853][SS] uppercase kafka topics fail when startingOffsets are 
SpecificOffsets
    
    When using the KafkaSource with Structured Streaming, consumer assignments 
are not what the user expects if startingOffsets is set to an explicit set of 
topics/partitions in JSON where the topic(s) happen to have uppercase 
characters. When StartingOffsets is constructed, the original string value from 
options is transformed toLowerCase to make matching on "earliest" and "latest" 
case insensitive. However, the toLowerCase JSON is passed to SpecificOffsets 
for the terminal condition, so topic names may not be what the user intended by 
the time assignments are made with the underlying KafkaConsumer.
    
    KafkaSourceProvider.scala:
    ```
    val startingOffsets = 
caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) 
match {
        case Some("latest") => LatestOffsets
        case Some("earliest") => EarliestOffsets
        case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
        case None => LatestOffsets
      }
    ```
    
    Thank cbowden for reporting.
    
    Jenkins
    
    Author: uncleGen <husty...@gmail.com>
    
    Closes #17209 from uncleGen/SPARK-19853.
    
    (cherry picked from commit 0a4d06a7c3db9fec2b6f050a631e8b59b0e9376e)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 454578257181b0ae8768f9d34fb64964b32530ce
Author: Herman van Hovell <hvanhovell@...>
Date:   2017-03-14T17:52:16Z

    [SPARK-19933][SQL] Do not change output of a subquery
    
    ## What changes were proposed in this pull request?
    The `RemoveRedundantAlias` rule can change the output attributes (the 
expression id's to be precise) of a query by eliminating the redundant alias 
producing them. This is no problem for a regular query, but can cause problems 
for correlated subqueries: The attributes produced by the subquery are used in 
the parent plan; changing them will break the parent plan.
    
    This PR fixes this by wrapping a subquery in a `Subquery` top level node 
when it gets optimized. The `RemoveRedundantAlias` rule now recognizes 
`Subquery` and makes sure that the output attributes of the `Subquery` node are 
retained.
    
    ## How was this patch tested?
    Added a test case to `RemoveRedundantAliasAndProjectSuite` and added a 
regression test to `SubquerySuite`.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #17278 from hvanhovell/SPARK-19933.
    
    (cherry picked from commit e04c05cf41a125b0526f59f9b9e7fdf0b78b8b21)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit a0ce845d9ad87753f676785301ab7ccd8ddd6368
Author: Wenchen Fan <wenchen@...>
Date:   2017-03-15T00:24:41Z

    [SPARK-19887][SQL] dynamic partition keys can be null or empty string
    
    When dynamic partition value is null or empty string, we should write the 
data to a directory like `a=__HIVE_DEFAULT_PARTITION__`, when we read the data 
back, we should respect this special directory name and treat it as null.
    
    This is the same behavior of impala, see 
https://issues.apache.org/jira/browse/IMPALA-252
    
    new regression test
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #17277 from cloud-fan/partition.
    
    (cherry picked from commit dacc382f0c918f1ca808228484305ce0e21c705e)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 80ebca62cbdb7d5c8606e95a944164ab1a943694
Author: Reynold Xin <rxin@...>
Date:   2017-03-15T12:07:20Z

    [SPARK-19944][SQL] Move SQLConf from sql/core to sql/catalyst (branch-2.1)
    
    ## What changes were proposed in this pull request?
    This patch moves SQLConf from sql/core to sql/catalyst. To minimize the 
changes, the patch used type alias to still keep CatalystConf (as a type alias) 
and SimpleCatalystConf (as a concrete class that extends SQLConf).
    
    Motivation for the change is that it is pretty weird to have SQLConf only 
in sql/core and then we have to duplicate config options that impact 
optimizer/analyzer in sql/catalyst using CatalystConf.
    
    This is a backport into branch-2.1 to minimize merge conflicts.
    
    ## How was this patch tested?
    N/A
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #17301 from rxin/branch-2.1-conf.

commit 062254635a98da0b08f69dc7e8907079cfdce035
Author: hyukjinkwon <gurwls223@...>
Date:   2017-03-15T17:17:18Z

    [SPARK-19872] [PYTHON] Use the correct deserializer for RDD construction 
for coalesce/repartition
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to use the correct deserializer, `BatchedSerializer` for 
RDD construction for coalesce/repartition when the shuffle is enabled. 
Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from 
the copied one.
    
    with the file, `text.txt` below:
    
    ```
    a
    b
    
    d
    e
    f
    g
    h
    i
    j
    k
    l
    
    ```
    
    - Before
    
    ```python
    >>> sc.textFile('text.txt').repartition(1).collect()
    ```
    
    ```
    UTF8Deserializer(True)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/rdd.py", line 811, in collect
        return list(_load_from_socket(port, self._jrdd_deserializer))
      File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
        yield self.loads(stream)
      File ".../spark/python/pyspark/serializers.py", line 544, in loads
        return s.decode("utf-8") if self.use_unicode else s
      File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py",
 line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
invalid start byte
    ```
    
    - After
    
    ```python
    >>> sc.textFile('text.txt').repartition(1).collect()
    ```
    
    ```
    [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
    ```
    
    ## How was this patch tested?
    
    Unit test in `python/pyspark/tests.py`.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #17282 from HyukjinKwon/SPARK-19872.
    
    (cherry picked from commit 7387126f83dc0489eb1df734bfeba705709b7861)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit 9d032d02c8988d221a6e4cb27e6ee31627ed8a8e
Author: windpiger <songjun@...>
Date:   2017-03-16T17:30:39Z

    [SPARK-19329][SQL][BRANCH-2.1] Reading from or writing to a datasource 
table with a non pre-existing location should succeed
    
    ## What changes were proposed in this pull request?
    
    This is a backport pr of https://github.com/apache/spark/pull/16672 into 
branch-2.1.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: windpiger <song...@outlook.com>
    
    Closes #17317 from windpiger/backport-insertnotexists.

commit 4b977ff041681e73529e745c63a3f7c2b185df2b
Author: Xiao Li <gatorsmile@...>
Date:   2017-03-17T02:57:53Z

    [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL] 
Backport Three Cache-related PRs to Spark 2.1
    
    ### What changes were proposed in this pull request?
    
    Backport a few cache related PRs:
    
    ---
    [[SPARK-19093][SQL] Cached tables are not used in 
SubqueryExpression](https://github.com/apache/spark/pull/16493)
    
    Consider the plans inside subquery expressions while looking up cache 
manager to make
    use of cached data. Currently CacheManager.useCachedData does not consider 
the
    subquery expressions in the plan.
    
    ---
    [[SPARK-19736][SQL] refreshByPath should clear all cached plans with the 
specified path](https://github.com/apache/spark/pull/17064)
    
    Catalog.refreshByPath can refresh the cache entry and the associated 
metadata for all dataframes (if any), that contain the given data source path.
    
    However, CacheManager.invalidateCachedPath doesn't clear all cached plans 
with the specified path. It causes some strange behaviors reported in 
SPARK-15678.
    
    ---
    [[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached 
plans that refer to this table](https://github.com/apache/spark/pull/17097)
    
    When un-cache a table, we should not only remove the cache entry for this 
table, but also un-cache any other cached plans that refer to this table. The 
following commands trigger the table uncache: `DropTableCommand`, 
`TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, 
`RefreshTable` and `InsertIntoHiveTable`
    
    This PR also includes some refactors:
    - use java.util.LinkedList to store the cache entries, so that it's safer 
to remove elements while iterating
    - rename invalidateCache to recacheByPlan, which is more obvious about what 
it does.
    
    ### How was this patch tested?
    N/A
    
    Author: Xiao Li <gatorsm...@gmail.com>
    
    Closes #17319 from gatorsmile/backport-17097.

commit 710b5554e8a8a502f8a4fe9ce4b865b074646157
Author: Liwei Lin <lwlin7@...>
Date:   2017-03-17T17:41:17Z

    [SPARK-19721][SS][BRANCH-2.1] Good error message for version mismatch in 
log files
    
    ## Problem
    
    There are several places where we write out version identifiers in various 
logs for structured streaming (usually `v1`). However, in the places where we 
check for this, we throw a confusing error message.
    
    ## What changes were proposed in this pull request?
    
    This patch made two major changes:
    1. added a `parseVersion(...)` method, and based on this method, fixed the 
following places the way they did version checking (no other place needed to do 
this checking):
    ```
    HDFSMetadataLog
      - CompactibleFileStreamLog  ------------> fixed with this patch
        - FileStreamSourceLog  ---------------> inherited the fix of 
`CompactibleFileStreamLog`
        - FileStreamSinkLog  -----------------> inherited the fix of 
`CompactibleFileStreamLog`
      - OffsetSeqLog  ------------------------> fixed with this patch
      - anonymous subclass in KafkaSource  ---> fixed with this patch
    ```
    
    2. changed the type of `FileStreamSinkLog.VERSION`, 
`FileStreamSourceLog.VERSION` etc. from `String` to `Int`, so that we can 
identify newer versions via `version > 1` instead of `version != "v1"`
        - note this didn't break any backwards compatibility -- we are still 
writing out `"v1"` and reading back `"v1"`
    
    ## Exception message with this patch
    ```
    java.lang.IllegalStateException: Failed to read log file 
/private/var/folders/nn/82rmvkk568sd8p3p8tb33trw0000gn/T/spark-86867b65-0069-4ef1-b0eb-d8bd258ff5b8/0.
 UnsupportedLogVersion: maximum supported log version is v1, but encountered 
v99. The log file was produced by a newer version of Spark and cannot be read 
by this version. Please upgrade.
        at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:202)
        at 
org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:78)
        at 
org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:75)
        at 
org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:133)
        at 
org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite.withTempDir(OffsetSeqLogSuite.scala:26)
        at 
org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply$mcV$sp(OffsetSeqLogSuite.scala:75)
        at 
org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
        at 
org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
        at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
        at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    ```
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Liwei Lin <lwl...@gmail.com>
    
    Closes #17327 from lw-lin/good-msg-2.1.

commit 5fb70831bd3acf7e1d9933986ccce12c3872432b
Author: Shixiong Zhu <shixiong@...>
Date:   2017-03-17T18:12:23Z

    [SPARK-19986][TESTS] Make pyspark.streaming.tests.CheckpointTests more 
stable
    
    ## What changes were proposed in this pull request?
    
    Sometimes, CheckpointTests will hang on a busy machine because the 
streaming jobs are too slow and cannot catch up. I observed the scheduled delay 
was keeping increasing for dozens of seconds locally.
    
    This PR increases the batch interval from 0.5 seconds to 2 seconds to 
generate less Spark jobs. It should make 
`pyspark.streaming.tests.CheckpointTests` more stable. I also replaced `sleep` 
with `awaitTerminationOrTimeout` so that if the streaming job fails, it will 
also fail the test.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #17323 from zsxwing/SPARK-19986.
    
    (cherry picked from commit 376d782164437573880f0ad58cecae1cb5f212f2)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21763: Branch 2.1

Reply via email to