Re: [OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

Xingbo Jiang Wed, 29 Apr 2020 11:13:21 -0700

Thank you so much for doing this, Xiao!

On Wed, Apr 29, 2020 at 11:09 AM Xiao Li <lix...@databricks.com> wrote:


> Hi all,
>
> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
> For each API/configuration/behavior change, an *[API] *tag is added in
> the title.
>
> CORE
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-30623core-spark-external-shuffle-allow-disable-of-separate-event-loop-group-66--33>[3.0][SPARK-30623][CORE]
> Spark external shuffle allow disable of separate event loop group (+66, -33)
> >
> <https://github.com/apache/spark/commit/0fe203e7032a326ff0c78f6660ab1b09409fe09d>
>
> PR#22173 <https://github.com/apache/spark/pull/22173> introduced a perf
> regression in shuffle, even if we disable the feature flag
> spark.shuffle.server.chunkFetchHandlerThreadsPercent. To fix the perf
> regression, this PR refactors the related code to completely disable this
> feature by default.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31314core-revert-spark-29285-to-fix-shuffle-regression-caused-by-creating-temporary-file-eagerly-10--71>[3.0][SPARK-31314][CORE]
> Revert SPARK-29285 to fix shuffle regression caused by creating temporary
> file eagerly (+10, -71)>
> <https://github.com/apache/spark/commit/07c50784d34e10bbfafac7498c0b70c4ec08048a>
>
> PR#25962 <https://github.com/apache/spark/pull/25962> introduced a perf
> regression in shuffle, which may create empty files unnecessarily. This PR
> reverts it.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api80spark-29154core-update-spark-scheduler-for-stage-level-scheduling-704--218>[API][3.1][SPARK-29154][CORE]
> Update Spark scheduler for stage level scheduling (+704, -218)>
> <https://github.com/apache/spark/commit/474b1bb5c2bce2f83c4dd8e19b9b7c5b3aebd6c4>
>
> This PR updates the DAG scheduler to schedule tasks to match the resource
> profile. It's for the stage level scheduling.
> [API][3.1][SPARK-29153][CORE] Add ability to merge resource profiles
> within a stage with Stage Level Scheduling (+304, -15)>
> <https://github.com/apache/spark/commit/55dea9be62019d64d5d76619e1551956c8bb64d0>
>
> Add the ability to optionally merged resource profiles if they are
> specified on multiple RDDs within a Stage. The feature is part of Stage
> Level Scheduling. There is a config
> spark.scheduler.resourceProfile.mergeConflicts to enable this feature,
> the config if off by default.
>
> spark.scheduler.resource.profileMergeConflicts (Default: false)
>
>    - If set to true, Spark will merge ResourceProfiles when different
>    profiles are specified in RDDs that get combined into a single stage. When
>    they are merged, Spark chooses the maximum of each resource and creates a
>    new ResourceProfile. The default of false results in Spark throwing an
>    exception if multiple different ResourceProfiles are found in RDDs going
>    into the same stage.
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#api80spark-31208core-add-an-experimental-api-cleanshuffledependencies-158--71>[API][3.1][SPARK-31208][CORE]
> Add an experimental API: cleanShuffleDependencies (+158, -71)>
> <https://github.com/apache/spark/commit/8f010bd0a81aa9c7a8d2b8037b6160ab3277cc74>
>
> Add a new experimental developer API RDD.cleanShuffleDependencies(blocking:
> Boolean) to allow explicitly clean up shuffle files. This could help
> dynamic scaling of K8s backend since the backend only recycles executors
> without shuffle files.
>
>   /**   * :: Experimental ::   * Removes an RDD's shuffles and it's 
> non-persisted ancestors.   * When running without a shuffle service, cleaning 
> up shuffle files enables downscaling.   * If you use the RDD after this call, 
> you should checkpoint and materialize it first.   * If you are uncertain of 
> what you are doing, please do not use this feature.   * Additional techniques 
> for mitigating orphaned shuffle files:   *   * Tuning the driver GC to be 
> more aggressive, so the regular context cleaner is triggered   *   * Setting 
> an appropriate TTL for shuffle files to be auto cleaned   */
>   @Experimental
>   @DeveloperApi
>   @Since("3.1.0")
>   def cleanShuffleDependencies(blocking: Boolean = false): Unit
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#80spark-31179-fast-fail-the-connection-while-last-connection-failed-in-fast-fail-time-window-68--12>[3.1][SPARK-31179]
> Fast fail the connection while last connection failed in fast fail time
> window (+68, -12)>
> <https://github.com/apache/spark/commit/ec289252368127f8261eb6a2270362ba0b65db36>
>
> In TransportFactory, if a connection to the destination address fails, the
> new connection requests [that are created within a time window] fail fast
> for avoiding too many retries. This time window size is set to 95% of the
> IO retry wait time (spark.io.shuffle.retryWait whose default is 5 seconds).
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#sql>
> SQL
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-25556spark-17636spark-31026spark-31060sql-nested-column-predicate-pushdown-for-parquet-852--480>[API][3.0][SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL]
> Nested Column Predicate Pushdown for Parquet (+852, -480)>
> <https://github.com/apache/spark/commit/cb0db213736de5c5c02b09a2d5c3e17254708ce1>
>
> This PR extends the data source framework to support filter pushdown with
> nested fields, and implement it in the Parquet data source.
>
> spark.sql.optimizer.nestedPredicatePushdown.enabled [Default: true]
>
>    - When true, Spark tries to push down predicates for nested columns
>    and or names containing dots to data sources. Currently, Parquet
>    implements both optimizations while ORC only supports predicates for names
>    containing dots. The other data sources don't support this feature yet.
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-30822sql-remove-semicolon-at-the-end-of-a-sql-query-14--7>[3.0][SPARK-30822][SQL]
> Remove semicolon at the end of a sql query (+14, -7)>
> <https://github.com/apache/spark/commit/44431d4b1a22c3db87d7e4a24df517d6d45905a8>
>
> This PR updates the SQL parser to ignore the trailing semicolons in the
> SQL queries. Now sql("select 1;") works.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31086sql-add-back-the-deprecated-sqlcontext-methods-403--0>[API][3.0][SPARK-31086][SQL]
> Add Back the Deprecated SQLContext methods (+403, -0)>
> <https://github.com/apache/spark/commit/b7e4cc775b7eac68606d1f385911613f5139db1b>
>
> This PR adds back the following APIs whose maintenance costs are
> relatively small.
>
>    - SQLContext.applySchema
>    - SQLContext.parquetFile
>    - SQLContext.jsonFile
>    - SQLContext.jsonRDD
>    - SQLContext.load
>    - SQLContext.jdbc
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31087sql-add-back-multiple-removed-apis-458--15>[API][3.0][SPARK-31087][SQL]
> Add Back Multiple Removed APIs (+458, -15)>
> <https://github.com/apache/spark/commit/3884455780a214c620f309e00d5a083039746755>
>
> This PR adds back the following APIs whose maintenance costs are
> relatively small.
>
>    - functions.toDegrees/toRadians
>    - functions.approxCountDistinct
>    - functions.monotonicallyIncreasingId
>    - Column.!==
>    - Dataset.explode
>    - Dataset.registerTempTable
>    - SQLContext.getOrCreate, setActive, clearActive, constructors
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31088sql-add-back-hivecontext-and-createexternaltable-532--11>[API][3.0][SPARK-31088][SQL]
> Add back HiveContext and createExternalTable (+532, -11)>
> <https://github.com/apache/spark/commit/b9eafcb52658b7f5ec60bb4ebcc9da0fde94e105>
>
> This PR adds back the following APIs whose maintenance costs are
> relatively small.
>
>    - HiveContext
>    - createExternalTable APIs
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31147sql-forbid-char-type-in-non-hive-serde-tables-134--20>[API][3.0][SPARK-31147][SQL]
> Forbid CHAR type in non-Hive-Serde tables (+134, -20)>
> <https://github.com/apache/spark/commit/4f274a4de96def5c6f2c5c273a1c22c4f4d68376>
>
> Spark introduced CHAR type for hive compatibility but it only works for
> hive tables. CHAR type is never documented and is treated as a STRING type
> for non-Hive tables. It violates the SQL standard and is very confusing.
> This PR forbids CHAR type in non-Hive tables as it's not supported
> correctly.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31170sql-spark-sql-cli-should-respect-hive-sitexml-and-sparksqlwarehousedir-159--66>[3.0][SPARK-31170][SQL]
> Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir
> (+159, -66)>
> <https://github.com/apache/spark/commit/8be16907c261657f83f5d5934bcd978d8dacf7ff>
>
> This PR fixes Spark CLI to respect the warehouse config(
> spark.sql.warehousr.dir or hive.metastore.warehourse.dir) and the
> hive-site.xml.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31201sql-add-an-individual-config-for-skewed-partition-threshold-21--4>[API][3.0][SPARK-31201][SQL]
> Add an individual config for skewed partition threshold (+21, -4)>
> <https://github.com/apache/spark/commit/05498af72e19b058b210815e1053f3fa9b0157d9>
>
> Skew join handling comes with an overhead: we need to read some data
> repeatedly. We should treat a partition as skewed if it's large enough so
> that it's beneficial to do so.
>
> Currently the size threshold is the advisory partition size, which is 64
> MB by default. This is not large enough for the skewed partition size
> threshold. Thus, a new conf is added.
>
> spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes (Default: 256
> MB)
>
>    - A partition is considered as skewed if its size in bytes is larger
>    than this threshold and also larger than
>    'spark.sql.adaptive.skewJoin.skewedPartitionFactor' multiplying the median
>    partition size. Ideally this config should be set larger than
>    'spark.sql.adaptive.advisoryPartitionSizeInBytes'
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31227sql-non-nullable-null-type-in-complex-types-should-not-coerce-to-nullable-type-73--29>[3.0][SPARK-31227][SQL]
> Non-nullable null type in complex types should not coerce to nullable type
> (+73, -29)>
> <https://github.com/apache/spark/commit/3bd10ce007832522e38583592b6f358e185cdb7d>
>
> This PR targets for non-nullable null type not to coerce to nullable type
> in complex types, to fix queries like concat(array(), array(1)).
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api66spark-31234sql-resetcommand-should-not-affect-static-sql-configuration-24--4>[API][2.4][SPARK-31234][SQL]
> ResetCommand should not affect static SQL Configuration (+24, -4)>
> <https://github.com/apache/spark/commit/44bd36ad7b315f4c7592cdc1edf04356fcd23645>
>
> Currently, ResetCommand clears all configurations, including runtime SQL
> configs, static SQL configs and spark context level configs. This PR fixes
> it to only clear the runtime SQL configs, as other configs are not supposed
> to change at runtime.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31238sql-rebase-dates-tofrom-julian-calendar-in-writeread-for-orc-datasource-153--9>[3.0][SPARK-31238][SQL]
> Rebase dates to/from Julian calendar in write/read for ORC datasource
> (+153, -9)>
> <https://github.com/apache/spark/commit/d72ec8574113f9a7e87f3d7ec56c8447267b0506>
>
> Spark 3.0 switches to the Proleptic Gregorian calendar, which is different
> from the ORC format. This PR rebases the datetime values when
> reading/writing ORC files, to adjust the calendar changes.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31297sql-speed-up-dates-rebasing-174--91>[3.0][SPARK-31297][SQL]
> Speed up dates rebasing (+174, -91)>
> <https://github.com/apache/spark/commit/bb0b416f0b3a2747a420b17d1bf659891bae3274>
>
> This PR replaces the current implementation of the
> rebaseGregorianToJulianDays() and rebaseJulianToGregorianDays() functions
> in DateTimeUtils by a new one which is based on the fact that difference
> between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was
> changed only 14 times for entire supported range of valid dates [0001-01-01,
> 9999-12-31]. This brings about 3x speedup for fixing the performance
> regression caused by the calendar switch.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31312sql-cache-class-instance-for-the-udf-instance-in-hivefunctionwrapper-203--52>[3.0][SPARK-31312][SQL]
> Cache Class instance for the UDF instance in HiveFunctionWrapper (+203, -52)
> >
> <https://github.com/apache/spark/commit/2a6aa8e87bec39f6bfec67e151ef8566b75caecd>
>
> This PR caches the Class instance for the UDF instance in
> HiveFunctionWrapper to fix the case where Hive simple UDF is somehow
> transformed (expression is copied) and evaluated later with another
> classloader (for the case current thread context classloader is somehow
> changed). In this case, Spark throws CNFE as of now.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-29721sql-prune-unnecessary-nested-fields-from-generate-without-project-172--18>[3.1][SPARK-29721][SQL]
> Prune unnecessary nested fields from Generate without Project (+172, -18)>
> <https://github.com/apache/spark/commit/aa8776bb5912f695b5269a3d856111aa49419d1b>
>
> This patch prunes unnecessary nested fields from Generate which has no
> Project on top of it.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-31253sql-add-metrics-to-aqe-shuffle-reader-229--106>[3.1][SPARK-31253][SQL]
> Add metrics to AQE shuffle reader (+229, -106)>
> <https://github.com/apache/spark/commit/34c7ec8e0cb395da50e5cbeee67463414dacd776>
>
> This PR adds SQL metrics to the AQE shuffle reader (
> CustomShuffleReaderExec), so that it's easier to debug AQE with SQL web
> UI.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-30532-dataframestatfunctions-to-work-with-tablecolumn-syntax-31--8>[3.0][SPARK-30532]
> DataFrameStatFunctions to work with TABLE.COLUMN syntax (+31, -8)>
> <https://github.com/apache/spark/commit/22bb6b0fddb3ecd3ac0ad2b41a5024c86b8a6fc7>
>
> This PR makes approxQuantile, freqItems, cov and corr in
> DataFrameStatFunctions to support fully qualified column names. For
> example, df2.as("t2").crossJoin(df1.as("t1")).stat.approxQuantile("t1.num",
> Array(0.1), 0.0) now works.
> [API][3.0][SPARK-31113][SQL] Add SHOW VIEWS command (+605, -10)>
> <https://github.com/apache/spark/commit/a28ed86a387b286745b30cd4d90b3d558205a5a7>
>
> Add the SHOW VIEWS command to list the metadata for views. The SHOW VIEWS 
> statement
> returns all the views for an optionally specified database. Additionally,
> the output of this statement may be filtered by an optional matching
> pattern. If no database is specified then the views are returned from the
> current database. If the specified database is global temporary view
> database, we will list global temporary views. Note that the command also
> lists local temporary views regardless of a given database.
>
> For example:
>
> -- List all views in default database
> SHOW VIEWS;
>   +-------------+------------+--------------+--+
>   | namespace   | viewName   | isTemporary  |
>   +-------------+------------+--------------+--+
>   | default     | sam        | false        |
>   | default     | sam1       | false        |
>   | default     | suj        | false        |
>   |             | temp2      | true         |
>   +-------------+------------+--------------+--+
>
> -- List all views from userdb database
> SHOW VIEWS FROM userdb;
>   +-------------+------------+--------------+--+
>   | namespace   | viewName   | isTemporary  |
>   +-------------+------------+--------------+--+
>   | userdb      | user1      | false        |
>   | userdb      | user2      | false        |
>   |             | temp2      | true         |
>   +-------------+------------+--------------+--+
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#70spark-31224sql-add-view-support-to-show-create-table-151--133>[3.0][SPARK-31224][SQL]
> Add view support to SHOW CREATE TABLE (+151, -133)>
> <https://github.com/apache/spark/commit/d782a1c4565c401a02531db3b8aa3cb6fc698fb1>
>
> This PR fixes the regression introduced in Spark 3.0. Now, both SHOW
> CREATE TABLE and SHOW CREATE TABLE AS SERDE can support view.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#api70spark-31318sql-split-parquetavro-configs-for-rebasing-datestimestamps-in-read-and-in-write-65--33>[API][3.0][SPARK-31318][SQL]
> Split Parquet/Avro configs for rebasing dates/timestamps in read and in
> write (+65, -33)>
> <https://github.com/apache/spark/commit/c5323d2e8d74edddefcb0bb7e965d12de45abb18>
>
> The PR replaces the configs
> spark.sql.legacy.parquet.rebaseDateTime.enabled and
> spark.sql.legacy.avro.rebaseDateTime.enabled by the new configs
> spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled,
> spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled,
> spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled,
> spark.sql.legacy.avro.rebaseDateTimeInRead.enabled. Thus, users are able
> to load dates/timestamps saved by Spark 2.4/3.0, and save to Parquet/Avro
> files which are compatible with Spark 3.0/2.4 without rebasing.
>
> spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled (Default: false)
>
>    - When true, rebase dates/timestamps from Proleptic Gregorian calendar
>    to the hybrid calendar (Julian + Gregorian) in writing. The rebasing is
>    performed by converting micros/millis/days to a local date/timestamp in the
>    source calendar, interpreting the resulted date/timestamp in the target
>    calendar, and getting the number of micros/millis/days since the epoch
>    1970-01-01 00:00:00Z.
>
> spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled (Default: false)
>
>    - When true, rebase dates/timestamps from the hybrid calendar to the
>    Proleptic Gregorian calendar in read. The rebasing is performed by
>    converting micros/millis/days to a local date/timestamp in the source
>    calendar, interpreting the resulted date/timestamp in the target calendar,
>    and getting the number of micros/millis/days since the epoch 1970-01-01
>    00:00:00Z.
>
> spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled (Default: false)
>
>    - When true, rebase dates/timestamps from Proleptic Gregorian calendar
>    to the hybrid calendar (Julian + Gregorian) in writing. The rebasing is
>    performed by converting micros/millis/days to a local date/timestamp in the
>    source calendar, interpreting the resulted date/timestamp in the target
>    calendar, and getting the number of micros/millis/days since the epoch
>    1970-01-01 00:00:00Z.
>
> spark.sql.legacy.avro.rebaseDateTimeInRead.enabled (Default: false)
>
>    - When true, rebase dates/timestamps from the hybrid calendar to the
>    Proleptic Gregorian calendar in read. The rebasing is performed by
>    converting micros/millis/days to a local date/timestamp in the source
>    calendar, interpreting the resulted date/timestamp in the target calendar,
>    and getting the number of micros/millis/days since the epoch 1970-01-01
>    00:00:00Z.
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#70spark-31322sql-rename-queryplancollectinplanandsubqueries-to-collectwithsubqueries-5--5>[3.0][SPARK-31322][SQL]
> rename QueryPlan.collectInPlanAndSubqueries to collectWithSubqueries (+5,
> -5)>
> <https://github.com/apache/spark/commit/09f036a14cee4825edc73b463e1eebe85ff1c915>
>
> Rename QueryPlan.collectInPlanAndSubqueries [which was introduced in the
> unreleased Spark 3.0] to collectWithSubqueries. QueryPlan's APIs are
> internal but they are the core of catalyst and we'd better make the API
> name clearer before we release it.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#70spark-31327sql-write-spark-version-into-avro-file-metadata-116--5>[3.0][SPARK-31327][SQL]
> Write Spark version into Avro file metadata (+116, -5)>
> <https://github.com/apache/spark/commit/6b1ca886c0066f4e10534336f3fce64cdebc79a5>
>
> Write Spark version into Avro file metadata. The version info is very
> useful for backward compatibility.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#70spark-31328sql-fix-rebasing-of-overlapped-local-timestamps-during-daylight-saving-time-120--90>[3.0][SPARK-31328][SQL]
> Fix rebasing of overlapped local timestamps during daylight saving time
> (+120, -90)>
> <https://github.com/apache/spark/commit/820bb9985a76a567b79ffb01bbd2d32788e1dba0>
>
> The PR fixes the bug of losing DST [Daily Saving Time] offset info in
> rebasing timestamps via local date-time. This bug could lead to
> DateTimeUtils.rebaseGregorianToJulianMicros() and
> DateTimeUtils.rebaseJulianToGregorianMicros() returning wrong results
> when handling the daylight saving offset.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#ml>
> ML
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-31222ml-make-anovatest-sparsity-aware-173--112>[3.1][SPARK-31222][ML]
> Make ANOVATest Sparsity-Aware (+173, -112)>
> <https://github.com/apache/spark/commit/1dce6c1fd45de3d9cf911842f52494051692c48b>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-31223ml-set-seed-in-nprandom-to-regenerate-test-data-282--256>[3.1][SPARK-31223][ML]
> Set seed in np.random to regenerate test data (+282, -256)>
> <https://github.com/apache/spark/commit/d81df56f2dbd9757c87101fa32c28cf0cd96f278>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-31283ml-simplify-chisq-by-adding-a-common-method-68--93>[3.1][SPARK-31283][ML]
> Simplify ChiSq by adding a common method (+68, -93)>
> <https://github.com/apache/spark/commit/34d6b90449be43b865d06ecd3b4a7e363249ba9c>
> [API][3.1][SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR (+484, -36)
> >
> <https://github.com/apache/spark/commit/0d37f794ef9451199e1b757c1015fc7a8b3931a5>
>
> Add SparkR wrapper for FMClassifier.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#python>
> PYTHON
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#66spark-31186spark-31441pysparksql-topandas-should-not-fail-on-duplicate-column-names-56--10->[2.4][SPARK-31186][SPARK-31441][PYSPARK][SQL]
> toPandas should not fail on duplicate column names (+56, -10)>
> <https://github.com/apache/spark/commit/559d3e4051500d5c49e9a7f3ac33aac3de19c9c6>
>  >
> <https://github.com/apache/spark/commit/87be3641eb3517862dd5073903d5b37275852066>
>
> This PR fixes the toPandas API to make it work on dataframe with
> duplicate column names.
> [3.0][SPARK-30921][PYSPARK] Predicates on python udf should not be
> pushdown through Aggregate (+20, -2)>
> <https://github.com/apache/spark/commit/1f0287148977adb416001cb0988e919a2698c8e0>
>
> This PR is to fix a new regression in Spark 3.0, in which the predicates
> using PythonUDFs will be pushed down through Aggregate. Since PythonUDFs
> can't be evaluated on Filter, the rule "predicate pushdown through
> Aggregate" should skip the predicates that contain PythonUDFs.
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#80spark-31308pyspark-merging-pyfiles-to-files-argument-for-non-pyspark-applications-6--4>[3.1][SPARK-31308][PYSPARK]
> Merging pyFiles to files argument for Non-PySpark applications (+6, -4)>
> <https://github.com/apache/spark/commit/20fc6fa8398b9dc47b9ae7df52133a306f89b25f>
>
> Add python dependencies in SparkSubmit even if it is not a Python
> application. For some Spark applications(e.g. Livy remote SparkContext
> application, which is actually an embedded REPL for Scala/Python/R), they
> require not only jar dependencies, but also Python dependencies.
> UI
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#api80spark-31325sqlweb-ui-control-the-plan-explain-mode-in-the-events-of-sql-listeners-via-sqlconf-100--3>[API][3.1][SPARK-31325][SQL][WEB
> UI] Control the plan explain mode in the events of SQL listeners via
> SQLConf (+100, -3)>
> <https://github.com/apache/spark/commit/d98df7626b8a88cb9a1fee4f454b19333a9f3ced>
>
> Add a new config spark.sql.ui.explainMode to control the query explain
> mode used in the Spark SQL UI. The default value of this config is
> formatted, which would display the query details content in a formatted
> way.
>
> spark.sql.ui.explainMode (default: "formatted")
>
>    - Configures the query explain mode used in the Spark SQL UI. The
>    value can be 'simple', 'extended', 'codegen', 'cost', or 'formatted'.
>
>
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#r>
> R
> <https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#api70spark-31290r-add-back-the-deprecated-r-apis-200--4>[API][3.0][SPARK-31290][R]
> Add back the deprecated R APIs (+200, -4)>
> <https://github.com/apache/spark/commit/fd0b2281272daba590c6bb277688087d0b26053f>
>
> This PR is to add back the following R APIs which were removed in the
> unreleased Spark 3.0:
>
>    - sparkR.init
>    - sparkRSQL.init
>    - sparkRHive.init
>    - registerTempTable
>    - createExternalTable
>    - dropTempTable
>
>
>
> --
> <https://databricks.com/sparkaisummit/north-america>
>

Re: [OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

Reply via email to