[GitHub] spark pull request #23198: Branch 2.4

2018-12-01 Thread edurekagithub
Github user edurekagithub closed the pull request at:

https://github.com/apache/spark/pull/23198


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23198: Branch 2.4

2018-12-01 Thread edurekagithub
GitHub user edurekagithub opened a pull request:

https://github.com/apache/spark/pull/23198

Branch 2.4

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/edurekagithub/spark branch-2.4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23198.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23198


commit b7efca7ece484ee85091b1b50bbc84ad779f9bfe
Author: Mario Molina 
Date:   2018-09-11T12:47:14Z

[SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as 
null when nullValue is set.

## What changes were proposed in this pull request?

In the PR, I propose new CSV option `emptyValue` and an update in the SQL 
Migration Guide which describes how to revert previous behavior when empty 
strings were not written at all. Since Spark 2.4, empty strings are saved as 
`""` to distinguish them from saved `null`s.

Closes #22234
Closes #22367

## How was this patch tested?

It was tested by `CSVSuite` and new tests added in the PR #22234

Closes #22389 from MaxGekk/csv-empty-value-master.

Lead-authored-by: Mario Molina 
Co-authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 
(cherry picked from commit c9cb393dc414ae98093c1541d09fa3c8663ce276)
Signed-off-by: hyukjinkwon 

commit 0b8bfbe12b8a368836d7ddc8445de18b7ee42cde
Author: Dongjoon Hyun 
Date:   2018-09-11T15:57:42Z

[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent 
duplicate fields

## What changes were proposed in this pull request?

Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
STORED AS` should not generate files with duplicate fields because Spark cannot 
read those files back.

**INSERT OVERWRITE DIRECTORY USING**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
SELECT 'id', 'id2' id")
... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
inserting into file:/tmp/parquet: `id`;
```

**INSERT OVERWRITE DIRECTORY STORED AS**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
parquet SELECT 'id', 'id2' id")
// It generates corrupted files
scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
schema and the partition schema: `id`;
```

## How was this patch tested?

Pass the Jenkins with newly added test cases.

Closes #22378 from dongjoon-hyun/SPARK-25389.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 77579aa8c35b0d98bbeac3c828bf68a1d190d13e)
Signed-off-by: Dongjoon Hyun 

commit 4414e026097c74aadd252b541c9d3009cd7e9d09
Author: Gera Shegalov 
Date:   2018-09-11T16:28:32Z

[SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf 
values

## What changes were proposed in this pull request?

Stop trimming values of properties loaded from a file

## How was this patch tested?

Added unit test demonstrating the issue hit in production.

Closes #22213 from gerashegalov/gera/SPARK-25221.

Authored-by: Gera Shegalov 
Signed-off-by: Marcelo Vanzin 
(cherry picked from commit bcb9a8c83f4e6835af5dc51f1be7f964b8fa49a3)
Signed-off-by: Marcelo Vanzin 

commit 16127e844f8334e1152b2e3ed3d878ec8de13dfa
Author: Liang-Chi Hsieh 
Date:   2018-09-11T17:31:06Z

[SPARK-24889][CORE] Update block info when unpersist rdds

## What changes were proposed in this pull request?

We will update block info coming from executors, at the timing like caching 
a RDD. However, when removing RDDs with unpersisting, we don't ask to update 
block info. So the block info is not updated.

We can fix this with few options:

1. Ask to update block info when unpersisting

This is simplest but changes driver-executor communication a bit.

2. Update block info when processing the event of unpersisting RDD

We send a `SparkListenerUnpersistRDD` event when unpersisting RDD. When 
processing this event, we can update block info of the RDD. This only changes 
event processing code