GitHub user edurekagithub opened a pull request:
https://github.com/apache/spark/pull/23198
Branch 2.4
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/edurekagithub/spark branch-2.4
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23198.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23198
commit b7efca7ece484ee85091b1b50bbc84ad779f9bfe
Author: Mario Molina
Date: 2018-09-11T12:47:14Z
[SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as
null when nullValue is set.
## What changes were proposed in this pull request?
In the PR, I propose new CSV option `emptyValue` and an update in the SQL
Migration Guide which describes how to revert previous behavior when empty
strings were not written at all. Since Spark 2.4, empty strings are saved as
`""` to distinguish them from saved `null`s.
Closes #22234
Closes #22367
## How was this patch tested?
It was tested by `CSVSuite` and new tests added in the PR #22234
Closes #22389 from MaxGekk/csv-empty-value-master.
Lead-authored-by: Mario Molina
Co-authored-by: Maxim Gekk
Signed-off-by: hyukjinkwon
(cherry picked from commit c9cb393dc414ae98093c1541d09fa3c8663ce276)
Signed-off-by: hyukjinkwon
commit 0b8bfbe12b8a368836d7ddc8445de18b7ee42cde
Author: Dongjoon Hyun
Date: 2018-09-11T15:57:42Z
[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent
duplicate fields
## What changes were proposed in this pull request?
Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY
STORED AS` should not generate files with duplicate fields because Spark cannot
read those files back.
**INSERT OVERWRITE DIRECTORY USING**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet
SELECT 'id', 'id2' id")
... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when
inserting into file:/tmp/parquet: `id`;
```
**INSERT OVERWRITE DIRECTORY STORED AS**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS
parquet SELECT 'id', 'id2' id")
// It generates corrupted files
scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data
schema and the partition schema: `id`;
```
## How was this patch tested?
Pass the Jenkins with newly added test cases.
Closes #22378 from dongjoon-hyun/SPARK-25389.
Authored-by: Dongjoon Hyun
Signed-off-by: Dongjoon Hyun
(cherry picked from commit 77579aa8c35b0d98bbeac3c828bf68a1d190d13e)
Signed-off-by: Dongjoon Hyun
commit 4414e026097c74aadd252b541c9d3009cd7e9d09
Author: Gera Shegalov
Date: 2018-09-11T16:28:32Z
[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf
values
## What changes were proposed in this pull request?
Stop trimming values of properties loaded from a file
## How was this patch tested?
Added unit test demonstrating the issue hit in production.
Closes #22213 from gerashegalov/gera/SPARK-25221.
Authored-by: Gera Shegalov
Signed-off-by: Marcelo Vanzin
(cherry picked from commit bcb9a8c83f4e6835af5dc51f1be7f964b8fa49a3)
Signed-off-by: Marcelo Vanzin
commit 16127e844f8334e1152b2e3ed3d878ec8de13dfa
Author: Liang-Chi Hsieh
Date: 2018-09-11T17:31:06Z
[SPARK-24889][CORE] Update block info when unpersist rdds
## What changes were proposed in this pull request?
We will update block info coming from executors, at the timing like caching
a RDD. However, when removing RDDs with unpersisting, we don't ask to update
block info. So the block info is not updated.
We can fix this with few options:
1. Ask to update block info when unpersisting
This is simplest but changes driver-executor communication a bit.
2. Update block info when processing the event of unpersisting RDD
We send a `SparkListenerUnpersistRDD` event when unpersisting RDD. When
processing this event, we can update block info of the RDD. This only changes
event processing code