[GitHub] spark pull request #23107: small question in Spillable class

Charele Wed, 21 Nov 2018 11:12:59 -0800

GitHub user Charele opened a pull request:

    https://github.com/apache/spark/pull/23107


    small question in Spillable class

    Sorry my english skill, 
    I just only want to desc my question, I think should have a "Issues" button 
here.
    
    In org.apache.spark.util.collection.Spillable,
    code:
    private[this] var _elementsRead = 0
    ... ...
    shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
    
    The default value of numElementsForceSpillThreshold is Integer.MAX_VALUE,
    however, the _elementsRead is a Int type, I think the _elementsRead should 
a Long type, is it?
    private[this] var _elementsRead: Long = 0

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.4

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23107.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23107
    
----
commit 872bad161f1dbe6acd89b75f60053bfc8b621687
Author: Dilip Biswal <dbiswal@...>
Date:   2018-09-07T06:35:02Z

    [SPARK-25267][SQL][TEST] Disable ConvertToLocalRelation in the test cases 
of sql/core and sql/hive
    
    ## What changes were proposed in this pull request?
    In SharedSparkSession and TestHive, we need to disable the rule 
ConvertToLocalRelation for better test case coverage.
    ## How was this patch tested?
    Identify the failures after excluding "ConvertToLocalRelation" rule.
    
    Closes #22270 from dilipbiswal/SPARK-25267-final.
    
    Authored-by: Dilip Biswal <dbis...@us.ibm.com>
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>
    (cherry picked from commit 6d7bc5af454341f6d9bfc1e903148ad7ba8de6f9)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 95a48b909d103e59602e883d472cb03c7c434168
Author: fjh100456 <fu.jinhua6@...>
Date:   2018-09-07T16:28:33Z

    [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test for CTAS
    
    ## What changes were proposed in this pull request?
    Before Apache Spark 2.3, table properties were ignored when writing data to 
a hive table(created with STORED AS PARQUET/ORC syntax), because the 
compression configurations were not passed to the FileFormatWriter in 
hadoopConf. Then it was fixed in #20087. But actually for CTAS with USING 
PARQUET/ORC syntax, table properties were ignored too when convertMastore, so 
the test case for CTAS not supported.
    
    Now it has been fixed  in #20522 , the test case should be enabled too.
    
    ## How was this patch tested?
    This only re-enables the test cases of previous PR.
    
    Closes #22302 from fjh100456/compressionCodec.
    
    Authored-by: fjh100456 <fu.jinh...@zte.com.cn>
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>
    (cherry picked from commit 473f2fb3bfd0e51c40a87e475392f2e2c8f912dd)
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>

commit 80567fad4e3d8d4573d4095b1e460452e597d81f
Author: Lee Dongjin <dongjin@...>
Date:   2018-09-07T17:36:15Z

    [MINOR][SS] Fix kafka-0-10-sql trivials
    
    ## What changes were proposed in this pull request?
    
    Fix unused imports & outdated comments on `kafka-0-10-sql` module. (Found 
while I was working on 
[SPARK-23539](https://github.com/apache/spark/pull/22282))
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Closes #22342 from dongjinleekr/feature/fix-kafka-sql-trivials.
    
    Authored-by: Lee Dongjin <dong...@apache.org>
    Signed-off-by: Sean Owen <sean.o...@databricks.com>
    (cherry picked from commit 458f5011bd52851632c3592ac35f1573bc904d50)
    Signed-off-by: Sean Owen <sean.o...@databricks.com>

commit 904192ad18ff09cc5874e09b03447dd5f7754963
Author: WeichenXu <weichen.xu@...>
Date:   2018-09-08T16:09:14Z

    [SPARK-25345][ML] Deprecate public APIs from ImageSchema
    
    ## What changes were proposed in this pull request?
    
    Deprecate public APIs from ImageSchema.
    
    ## How was this patch tested?
    
    N/A
    
    Closes #22349 from WeichenXu123/image_api_deprecate.
    
    Authored-by: WeichenXu <weichen...@databricks.com>
    Signed-off-by: Xiangrui Meng <m...@databricks.com>
    (cherry picked from commit 08c02e637ac601df2fe890b8b5a7a049bdb4541b)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit 8f7d8a0977647dc96ab9259d306555bbe1c32873
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-09-08T17:21:55Z

    [SPARK-25375][SQL][TEST] Reenable qualified perm. function checks in 
UDFSuite
    
    ## What changes were proposed in this pull request?
    
    At Spark 2.0.0, SPARK-14335 adds some [commented-out test 
coverages](https://github.com/apache/spark/pull/12117/files#diff-dd4b39a56fac28b1ced6184453a47358R177
    ). This PR enables them because it's supported since 2.0.0.
    
    ## How was this patch tested?
    
    Pass the Jenkins with re-enabled test coverage.
    
    Closes #22363 from dongjoon-hyun/SPARK-25375.
    
    Authored-by: Dongjoon Hyun <dongj...@apache.org>
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>
    (cherry picked from commit 26f74b7cb16869079aa7b60577ac05707101ee68)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit a00a160e1e63ef2aaf3eaeebf2a3e5a5eb05d076
Author: gatorsmile <gatorsmile@...>
Date:   2018-09-09T13:25:19Z

    Revert [SPARK-10399] [SPARK-23879] [SPARK-23762] [SPARK-25317]
    
    ## What changes were proposed in this pull request?
    
    When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai  saw 
more than 10% performance regression on the following queries: q67, q24a and 
q24b. After we applying the PR https://github.com/apache/spark/pull/22338, the 
performance regression still exists. If we revert the changes in 
https://github.com/apache/spark/pull/19222, npoggi and winglungngai  found the 
performance regression was resolved. Thus, this PR is to revert the related 
changes for unblocking the 2.4 release.
    
    In the future release, we still can continue the investigation and find out 
the root cause of the regression.
    
    ## How was this patch tested?
    
    The existing test cases
    
    Closes #22361 from gatorsmile/revertMemoryBlock.
    
    Authored-by: gatorsmile <gatorsm...@gmail.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit 0b9ccd55c2986957863dcad3b44ce80403eecfa1)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 6b7ea78aec73b8f24c2e1161254edd5ebb6c82bf
Author: WeichenXu <weichen.xu@...>
Date:   2018-09-09T14:49:13Z

    [MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeasure` method
    
    ## What changes were proposed in this pull request?
    
    Remove `BisectingKMeansModel.setDistanceMeasure` method.
    In `BisectingKMeansModel` set this param is meaningless.
    
    ## How was this patch tested?
    
    N/A
    
    Closes #22360 from WeichenXu123/bkmeans_update.
    
    Authored-by: WeichenXu <weichen...@databricks.com>
    Signed-off-by: Sean Owen <sean.o...@databricks.com>
    (cherry picked from commit 88a930dfab56c15df02c7bb944444745c2921fa5)
    Signed-off-by: Sean Owen <sean.o...@databricks.com>

commit c1c1bda3cecd82a926526e5e5ee24d9909cb7e49
Author: Yuming Wang <yumwang@...>
Date:   2018-09-09T16:07:31Z

    [SPARK-25368][SQL] Incorrect predicate pushdown returns wrong result
    
    ## What changes were proposed in this pull request?
    How to reproduce:
    ```scala
    val df1 = spark.createDataFrame(Seq(
       (1, 1)
    )).toDF("a", "b").withColumn("c", lit(null).cast("int"))
    val df2 = df1.union(df1).withColumn("d", 
spark_partition_id).filter($"c".isNotNull)
    df2.show
    
    +---+---+----+---+
    |  a|  b|   c|  d|
    +---+---+----+---+
    |  1|  1|null|  0|
    |  1|  1|null|  1|
    +---+---+----+---+
    ```
    `filter($"c".isNotNull)` was transformed to `(null <=> c#10)` before 
https://github.com/apache/spark/pull/19201, but it is transformed to `(c#10 = 
null)` since https://github.com/apache/spark/pull/20155. This pr revert it to 
`(null <=> c#10)` to fix this issue.
    
    ## How was this patch tested?
    
    unit tests
    
    Closes #22368 from wangyum/SPARK-25368.
    
    Authored-by: Yuming Wang <yumw...@ebay.com>
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>
    (cherry picked from commit 77c996403d5c761f0dfea64c5b1cb7480ba1d3ac)
    Signed-off-by: gatorsmile <gatorsm...@gmail.com>

commit 0782dfa14c524131c04320e26d2b607777fe3b06
Author: seancxmao <seancxmao@...>
Date:   2018-09-10T02:22:47Z

    [SPARK-25175][SQL] Field resolution should fail if there is ambiguity for 
ORC native data source table persisted in metastore
    
    ## What changes were proposed in this pull request?
    Apache Spark doesn't create Hive table with duplicated fields in both 
case-sensitive and case-insensitive mode. However, if Spark creates ORC files 
in case-sensitive mode first and create Hive table on that location, where it's 
created. In this situation, field resolution should fail in case-insensitive 
mode. Otherwise, we don't know which columns will be returned or filtered. 
Previously, SPARK-25132 fixed the same issue in Parquet.
    
    Here is a simple example:
    
    ```
    val data = spark.range(5).selectExpr("id as a", "id * 2 as A")
    spark.conf.set("spark.sql.caseSensitive", true)
    
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")
    
    sql("CREATE TABLE orc_data_source (A LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'")
    spark.conf.set("spark.sql.caseSensitive", false)
    sql("select A from orc_data_source").show
    +---+
    |  A|
    +---+
    |  3|
    |  2|
    |  4|
    |  1|
    |  0|
    +---+
    ```
    
    See #22148 for more details about parquet data source reader.
    
    ## How was this patch tested?
    Unit tests added.
    
    Closes #22262 from seancxmao/SPARK-25175.
    
    Authored-by: seancxmao <seancx...@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>
    (cherry picked from commit a0aed475c54079665a8e5c5cd53a2e990a4f47b4)
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>

commit c9ca3594345610148ef5d993262d3090d5b2c658
Author: Yuming Wang <yumwang@...>
Date:   2018-09-10T05:47:19Z

    [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirCommand output schema in 
Parquet issue
    
    ## What changes were proposed in this pull request?
    
    How to reproduce:
    ```scala
    spark.sql("CREATE TABLE tbl(id long)")
    spark.sql("INSERT OVERWRITE TABLE tbl VALUES 4")
    spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
    spark.sql(s"INSERT OVERWRITE LOCAL DIRECTORY '/tmp/spark/parquet' " +
      "STORED AS PARQUET SELECT ID FROM view1")
    spark.read.parquet("/tmp/spark/parquet").schema
    scala> spark.read.parquet("/tmp/spark/parquet").schema
    res10: org.apache.spark.sql.types.StructType = 
StructType(StructField(id,LongType,true))
    ```
    The schema should be `StructType(StructField(ID,LongType,true))` as we 
`SELECT ID FROM view1`.
    
    This pr fix this issue.
    
    ## How was this patch tested?
    
    unit tests
    
    Closes #22359 from wangyum/SPARK-25313-FOLLOW-UP.
    
    Authored-by: Yuming Wang <yumw...@ebay.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit f8b4d5aafd1923d9524415601469f8749b3d0811)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 67bc7ef7b70b6b654433bd5e56cff2f5ec6ae9bd
Author: gatorsmile <gatorsmile@...>
Date:   2018-09-10T11:18:00Z

    [SPARK-24849][SPARK-24911][SQL][FOLLOW-UP] Converting a value of StructType 
to a DDL string
    
    ## What changes were proposed in this pull request?
    Add the version number for the new APIs.
    
    ## How was this patch tested?
    N/A
    
    Closes #22377 from gatorsmile/followup24849.
    
    Authored-by: gatorsmile <gatorsm...@gmail.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit 6f6517837ba9934a280b11aba9d9be58bc131f25)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 5d98c31941471bdcdc54a68f55ddaaab48f82161
Author: Marco Gaido <marcogaido91@...>
Date:   2018-09-10T11:41:51Z

    [SPARK-25278][SQL] Avoid duplicated Exec nodes when the same logical plan 
appears in the query
    
    ## What changes were proposed in this pull request?
    
    In the Planner, we collect the placeholder which need to be substituted in 
the query execution plan and once we plan them, we substitute the placeholder 
with the effective plan.
    
    In this second phase, we rely on the `==` comparison, ie. the `equals` 
method. This means that if two placeholder plans - which are different 
instances - have the same attributes (so that they are equal, according to the 
equal method) they are both substituted with their corresponding new physical 
plans. So, in such a situation, the first time we substitute both them with the 
first of the 2 new generated plan and the second time we substitute nothing.
    
    This is usually of no harm for the execution of the query itself, as the 2 
plans are identical. But since they are the same instance, now, the local 
variables are shared (which is unexpected). This causes issues for the metrics 
collected, as the same node is executed 2 times, so the metrics are accumulated 
2 times, wrongly.
    
    The PR proposes to use the `eq` method in checking which placeholder needs 
to be substituted,; thus in the previous situation, actually both the two 
different physical nodes which are created (one for each time the logical plan 
appears in the query plan) are used and the metrics are collected properly for 
each of them.
    
    ## How was this patch tested?
    
    added UT
    
    Closes #22284 from mgaido91/SPARK-25278.
    
    Authored-by: Marco Gaido <marcogaid...@gmail.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit 12e3e9f17dca11a2cddf0fb99d72b4b97517fb56)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit ffd036a6d13814ebcc332990be1e286939cc6abe
Author: Holden Karau <holden@...>
Date:   2018-09-10T18:01:51Z

    [SPARK-23672][PYTHON] Document support for nested return types in scalar 
with arrow udfs
    
    ## What changes were proposed in this pull request?
    
    Clarify docstring for Scalar functions
    
    ## How was this patch tested?
    
    Adds a unit test showing use similar to wordcount, there's existing unit 
test for array of floats as well.
    
    Closes #20908 from 
holdenk/SPARK-23672-document-support-for-nested-return-types-in-scalar-with-arrow-udfs.
    
    Authored-by: Holden Karau <hol...@pigscanfly.ca>
    Signed-off-by: Bryan Cutler <cutl...@gmail.com>
    (cherry picked from commit da5685b5bb9ee7daaeb4e8f99c488ebd50c7aac3)
    Signed-off-by: Bryan Cutler <cutl...@gmail.com>

commit fb4965a41941f3a196de77a870a8a1f29c96dac0
Author: Marco Gaido <marcogaido91@...>
Date:   2018-09-11T06:16:56Z

    [SPARK-25371][SQL] struct() should allow being called with 0 args
    
    ## What changes were proposed in this pull request?
    
    SPARK-21281 introduced a check for the inputs of `CreateStructLike` to be 
non-empty. This means that `struct()`, which was previously considered valid, 
now throws an Exception.  This behavior change was introduced in 2.3.0. The 
change may break users' application on upgrade and it causes `VectorAssembler` 
to fail when an empty `inputCols` is defined.
    
    The PR removes the added check making `struct()` valid again.
    
    ## How was this patch tested?
    
    added UT
    
    Closes #22373 from mgaido91/SPARK-25371.
    
    Authored-by: Marco Gaido <marcogaid...@gmail.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit 0736e72a66735664b191fc363f54e3c522697dba)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit b7efca7ece484ee85091b1b50bbc84ad779f9bfe
Author: Mario Molina <mmolimar@...>
Date:   2018-09-11T12:47:14Z

    [SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as 
null when nullValue is set.
    
    ## What changes were proposed in this pull request?
    
    In the PR, I propose new CSV option `emptyValue` and an update in the SQL 
Migration Guide which describes how to revert previous behavior when empty 
strings were not written at all. Since Spark 2.4, empty strings are saved as 
`""` to distinguish them from saved `null`s.
    
    Closes #22234
    Closes #22367
    
    ## How was this patch tested?
    
    It was tested by `CSVSuite` and new tests added in the PR #22234
    
    Closes #22389 from MaxGekk/csv-empty-value-master.
    
    Lead-authored-by: Mario Molina <mmoli...@gmail.com>
    Co-authored-by: Maxim Gekk <maxim.g...@databricks.com>
    Signed-off-by: hyukjinkwon <gurwls...@apache.org>
    (cherry picked from commit c9cb393dc414ae98093c1541d09fa3c8663ce276)
    Signed-off-by: hyukjinkwon <gurwls...@apache.org>

commit 0b8bfbe12b8a368836d7ddc8445de18b7ee42cde
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-09-11T15:57:42Z

    [SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent 
duplicate fields
    
    ## What changes were proposed in this pull request?
    
    Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
STORED AS` should not generate files with duplicate fields because Spark cannot 
read those files back.
    
    **INSERT OVERWRITE DIRECTORY USING**
    ```scala
    scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
SELECT 'id', 'id2' id")
    ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
    org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
inserting into file:/tmp/parquet: `id`;
    ```
    
    **INSERT OVERWRITE DIRECTORY STORED AS**
    ```scala
    scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
parquet SELECT 'id', 'id2' id")
    // It generates corrupted files
    scala> spark.read.parquet("/tmp/parquet").show
    18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
schema and the partition schema: `id`;
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with newly added test cases.
    
    Closes #22378 from dongjoon-hyun/SPARK-25389.
    
    Authored-by: Dongjoon Hyun <dongj...@apache.org>
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>
    (cherry picked from commit 77579aa8c35b0d98bbeac3c828bf68a1d190d13e)
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>

commit 4414e026097c74aadd252b541c9d3009cd7e9d09
Author: Gera Shegalov <gera@...>
Date:   2018-09-11T16:28:32Z

    [SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf 
values
    
    ## What changes were proposed in this pull request?
    
    Stop trimming values of properties loaded from a file
    
    ## How was this patch tested?
    
    Added unit test demonstrating the issue hit in production.
    
    Closes #22213 from gerashegalov/gera/SPARK-25221.
    
    Authored-by: Gera Shegalov <g...@apache.org>
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>
    (cherry picked from commit bcb9a8c83f4e6835af5dc51f1be7f964b8fa49a3)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit 16127e844f8334e1152b2e3ed3d878ec8de13dfa
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-09-11T17:31:06Z

    [SPARK-24889][CORE] Update block info when unpersist rdds
    
    ## What changes were proposed in this pull request?
    
    We will update block info coming from executors, at the timing like caching 
a RDD. However, when removing RDDs with unpersisting, we don't ask to update 
block info. So the block info is not updated.
    
    We can fix this with few options:
    
    1. Ask to update block info when unpersisting
    
    This is simplest but changes driver-executor communication a bit.
    
    2. Update block info when processing the event of unpersisting RDD
    
    We send a `SparkListenerUnpersistRDD` event when unpersisting RDD. When 
processing this event, we can update block info of the RDD. This only changes 
event processing code so the risk seems to be lower.
    
    Currently this patch takes option 2 for lower risk. If we agree first 
option has no risk, we can change to it.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Closes #22341 from viirya/SPARK-24889.
    
    Authored-by: Liang-Chi Hsieh <vii...@gmail.com>
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>
    (cherry picked from commit 14f3ad20932535fe952428bf255e7eddd8fa1b58)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit 99b37a91871f8bf070d43080f1c58475548c99fd
Author: Sean Owen <sean.owen@...>
Date:   2018-09-11T19:46:03Z

    [SPARK-25398] Minor bugs from comparing unrelated types
    
    ## What changes were proposed in this pull request?
    
    Correct some comparisons between unrelated types to what they seem toâ¦ 
have been trying to do
    
    ## How was this patch tested?
    
    Existing tests.
    
    Closes #22384 from srowen/SPARK-25398.
    
    Authored-by: Sean Owen <sean.o...@databricks.com>
    Signed-off-by: Sean Owen <sean.o...@databricks.com>
    (cherry picked from commit cfbdd6a1f5906b848c520d3365cc4034992215d9)
    Signed-off-by: Sean Owen <sean.o...@databricks.com>

commit 3a6ef8b7e2d17fe22458bfd249f45b5a5ce269ec
Author: Sean Owen <sean.owen@...>
Date:   2018-09-11T19:52:58Z

    Revert "[SPARK-23820][CORE] Enable use of long form of callsite in logs"
    
    This reverts commit e58dadb77ed6cac3e1b2a037a6449e5a6e7f2cec.

commit 0dbf1450f7965c27ce9329c7dad351ff8b8072dc
Author: Mukul Murthy <mukul.murthy@...>
Date:   2018-09-11T22:53:15Z

    [SPARK-25399][SS] Continuous processing state should not affect microbatch 
execution jobs
    
    ## What changes were proposed in this pull request?
    
    The leftover state from running a continuous processing streaming job 
should not affect later microbatch execution jobs. If a continuous processing 
job runs and the same thread gets reused for a microbatch execution job in the 
same environment, the microbatch job could get wrong answers because it can 
attempt to load the wrong version of the state.
    
    ## How was this patch tested?
    
    New and existing unit tests
    
    Closes #22386 from mukulmurthy/25399-streamthread.
    
    Authored-by: Mukul Murthy <mukul.mur...@gmail.com>
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>
    (cherry picked from commit 9f5c5b4cca7d4eaa30a3f8adb4cb1eebe3f77c7a)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 40e4db0eb72be7640bd8b5b319ad4ba99c9dc846
Author: gatorsmile <gatorsmile@...>
Date:   2018-09-12T13:11:22Z

    [SPARK-25402][SQL] Null handling in BooleanSimplification
    
    ## What changes were proposed in this pull request?
    This PR is to fix the null handling in BooleanSimplification. In the rule 
BooleanSimplification, there are two cases that do not properly handle null 
values. The optimization is not right if either side is null. This PR is to fix 
them.
    
    ## How was this patch tested?
    Added test cases
    
    Closes #22390 from gatorsmile/fixBooleanSimplification.
    
    Authored-by: gatorsmile <gatorsm...@gmail.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit 79cc59718fdf7785bdc37a26bb8df4c6151114a6)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 071babbab5a49b7106d61b0c9a18672bd67e1786
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-09-12T14:54:05Z

    [SPARK-25352][SQL] Perform ordered global limit when limit number is bigger 
than topKSortFallbackThreshold
    
    ## What changes were proposed in this pull request?
    
    We have optimization on global limit to evenly distribute limit rows across 
all partitions. This optimization doesn't work for ordered results.
    
    For a query ending with sort + limit, in most cases it is performed by 
`TakeOrderedAndProjectExec`.
    
    But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, 
global limit will be used. At this moment, we need to do ordered global limit.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Closes #22344 from viirya/SPARK-25352.
    
    Authored-by: Liang-Chi Hsieh <vii...@gmail.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit 2f422398b524eacc89ab58e423bb134ae3ca3941)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 4c1428fa2b29c371458977427561d2b4bb9daa5b
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-09-12T17:43:40Z

    [SPARK-25363][SQL] Fix schema pruning in where clause by ignoring 
unnecessary root fields
    
    ## What changes were proposed in this pull request?
    
    Schema pruning doesn't work if nested column is used in where clause.
    
    For example,
    ```
    sql("select name.first from contacts where name.first = 'David'")
    
    == Physical Plan ==
    *(1) Project [name#19.first AS first#40]
    +- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
       +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, 
PartitionFilters: [],
        PushedFilters: [IsNotNull(name)], ReadSchema: 
struct<name:struct<first:string,middle:string,last:string>>
    ```
    
    In above query plan, the scan node reads the entire schema of `name` column.
    
    This issue is reported by:
    https://github.com/apache/spark/pull/21320#issuecomment-419290197
    
    The cause is that we infer a root field from expression `IsNotNull(name)`. 
However, for such expression, we don't really use the nested fields of this 
root field, so we can ignore the unnecessary nested fields.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Closes #22357 from viirya/SPARK-25363.
    
    Authored-by: Liang-Chi Hsieh <vii...@gmail.com>
    Signed-off-by: DB Tsai <d_t...@apple.com>
    (cherry picked from commit 3030b82c89d3e45a2e361c469fbc667a1e43b854)
    Signed-off-by: DB Tsai <d_t...@apple.com>

commit 15d2e9d7d2f0d5ecefd69bdc3f8a149670b05e79
Author: Wenchen Fan <wenchen@...>
Date:   2018-09-12T18:25:24Z

    [SPARK-24882][SQL] Revert [] improve data source v2 API from branch 2.4
    
    ## What changes were proposed in this pull request?
    
    As discussed in the dev list, we don't want to include 
https://github.com/apache/spark/pull/22009 in Spark 2.4, as it needs data 
source v2 users to change the implementation intensitively, while they need to 
change again in next release.
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #22388 from cloud-fan/revert.

commit 71f70130f1b2b4ec70595627f0a02a88e2c0e27d
Author: Michael Mior <mmior@...>
Date:   2018-09-13T01:45:25Z

    [SPARK-23820][CORE] Enable use of long form of callsite in logs
    
    This is a rework of #21433 to address some concerns there.
    
    Closes #22398 from michaelmior/long-callsite2.
    
    Authored-by: Michael Mior <mm...@uwaterloo.ca>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit ab25c967905ca0973fc2f30b8523246bb9244206)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 776dc42c1326764233a4466172330b74b98df7aa
Author: Maxim Gekk <max.gekk@...>
Date:   2018-09-13T01:51:49Z

    [SPARK-25387][SQL] Fix for NPE caused by bad CSV input
    
    ## What changes were proposed in this pull request?
    
    The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In 
some cases, `uniVocity` parser can return `null` for bad input. In the PR, I 
propose to check result of parsing and not propagate NPE to upper layers.
    
    ## How was this patch tested?
    
    I added a test which reproduce the issue and tested by `CSVSuite`.
    
    Closes #22374 from MaxGekk/npe-on-bad-csv.
    
    Lead-authored-by: Maxim Gekk <max.g...@gmail.com>
    Co-authored-by: Maxim Gekk <maxim.g...@databricks.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit 083c9447671719e0bd67312e3d572f6160c06a4a)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 6f4d647e07ef527ef93c4fc849a478008a52bc80
Author: LantaoJin <jinlantao@...>
Date:   2018-09-13T01:57:34Z

    [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump more information 
like file path to event log
    
    ## What changes were proposed in this pull request?
    
    Field metadata removed from SparkPlanInfo in #18600 . Corresponding, many 
meta data was also removed from event SparkListenerSQLExecutionStart in Spark 
event log. If we want to analyze event log to get all input paths, we couldn't 
get them. Instead, simpleString of SparkPlanInfo JSON only display 100 
characters, it won't help.
    
    Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log 
looks like below (It contains the metadata field which has the intact 
information):
    
>{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", 
Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., 
"metadata": {"Location": 
"InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4/test5/snapshot/dt=20180904]","ReadSchema":"struct<snpsht_start_dt:date,snpsht_end_dt:date,am_ntlogin_name:string,am_first_name:string,am_last_name:string,isg_name:string,CRE_DATE:date,CRE_USER:string,UPD_DATE:timestamp,UPD_USER:string>"}
    
    After #18600, metadata field was removed.
    
>{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", 
Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4...,
    
    So I add this field back to SparkPlanInfo class. Then it will log out the 
meta data to event log. Intact information in event log is very useful for 
offline job analysis.
    
    ## How was this patch tested?
    Unit test
    
    Closes #22353 from LantaoJin/SPARK-25357.
    
    Authored-by: LantaoJin <jinlan...@gmail.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit 6dc5921e66d56885b95c07e56e687f9f6c1eaca7)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit ae5c7bb204c52dd18cfb63e5c621537023e36539
Author: Sean Owen <sean.owen@...>
Date:   2018-09-13T03:19:43Z

    [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4
    
    (This change is a subset of the changes needed for the JIRA; see 
https://github.com/apache/spark/pull/22231)
    
    ## What changes were proposed in this pull request?
    
    Use raw strings and simpler regex syntax consistently in Python, which also 
avoids warnings from pycodestyle about accidentally relying Python's 
non-escaping of non-reserved chars in normal strings. Also, fix a few long 
lines.
    
    ## How was this patch tested?
    
    Existing tests, and some manual double-checking of the behavior of regexes 
in Python 2/3 to be sure.
    
    Closes #22400 from srowen/SPARK-25238.2.
    
    Authored-by: Sean Owen <sean.o...@databricks.com>
    Signed-off-by: hyukjinkwon <gurwls...@apache.org>
    (cherry picked from commit 08c76b5d39127ae207d9d1fff99c2551e6ce2581)
    Signed-off-by: hyukjinkwon <gurwls...@apache.org>

commit abb5196c7ef685e1027eb1b0b09f4559d3eba015
Author: Stavros Kontopoulos <stavros.kontopoulos@...>
Date:   2018-09-13T05:02:59Z

    [SPARK-25295][K8S] Fix executor names collision
    
    ## What changes were proposed in this pull request?
    Fixes the collision issue with spark executor names in client mode, see 
SPARK-25295 for the details.
    It follows the cluster name convention as app-name will be used as the 
prefix and if that is not defined we use "spark" as the default prefix. Eg. 
`spark-pi-1536781360723-exec-1` where spark-pi is the name of the app passed at 
the config side or transformed if it contains illegal characters.
    
    Also fixes the issue with spark app name having spaces in cluster mode.
    If you run the Spark Pi test in client mode it passes.
    The tricky part is the user may set the app name:
    
https://github.com/apache/spark/blob/3030b82c89d3e45a2e361c469fbc667a1e43b854/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala#L30
    If i do:
    
    ```
    ./bin/spark-submit
    ...
     --deploy-mode cluster --name "spark pi"
    ...
    ```
    it will fail as the app name is used for the prefix of driver's pod name 
and it cannot have spaces (according to k8s conventions).
    
    ## How was this patch tested?
    
    Manually by running spark job in client mode.
    To reproduce do:
    ```
    kubectl create -f service.yaml
    kubectl create -f pod.yaml
    ```
     service.yaml :
    ```
    kind: Service
    apiVersion: v1
    metadata:
      name: spark-test-app-1-svc
    spec:
      clusterIP: None
      selector:
        spark-app-selector: spark-test-app-1
      ports:
      - protocol: TCP
        name: driver-port
        port: 7077
        targetPort: 7077
      - protocol: TCP
        name: block-manager
        port: 10000
        targetPort: 10000
    ```
    pod.yaml:
    
    ```
    apiVersion: v1
    kind: Pod
    metadata:
      name: spark-test-app-1
      labels:
        spark-app-selector: spark-test-app-1
    spec:
      containers:
      - name: spark-test
        image: skonto/spark:k8s-client-fix
        imagePullPolicy: Always
        command:
          - 'sh'
          - '-c'
          -  "/opt/spark/bin/spark-submit
                  --verbose
                  --master k8s://https://kubernetes.default.svc
                  --deploy-mode client
                  --class org.apache.spark.examples.SparkPi
                  --conf spark.app.name=spark
                  --conf spark.executor.instances=1
                  --conf 
spark.kubernetes.container.image=skonto/spark:k8s-client-fix
                  --conf spark.kubernetes.container.image.pullPolicy=Always
                  --conf 
spark.kubernetes.authenticate.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token
                  --conf 
spark.kubernetes.authenticate.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  --conf spark.executor.memory=500m
                  --conf spark.executor.cores=1
                  --conf spark.executor.instances=1
                  --conf spark.driver.host=spark-test-app-1-svc.default.svc
                  --conf spark.driver.port=7077
                  --conf spark.driver.blockManager.port=10000
                  
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar 1000000"
    ```
    
    Closes #22405 from skonto/fix-k8s-client-mode-executor-names.
    
    Authored-by: Stavros Kontopoulos <stavros.kontopou...@lightbend.com>
    Signed-off-by: Yinan Li <y...@google.com>
    (cherry picked from commit 3e75a9fa24f8629d068b5fbbc7356ce2603fa58d)
    Signed-off-by: Yinan Li <y...@google.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23107: small question in Spillable class

Reply via email to