GitHub user sohum2002 opened a pull request:
https://github.com/apache/spark/pull/19446
Dataset optimization
The proposed two new additional functions is to help select all the columns
in a Dataset except for given columns.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sohum2002/spark dataset_optimization
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19446.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19446
commit 0e80ecae300f3e2033419b2d98da8bf092c105bb
Author: Wenchen Fan
Date: 2017-07-10T05:53:27Z
[SPARK-21100][SQL][FOLLOWUP] cleanup code and add more comments for
Dataset.summary
## What changes were proposed in this pull request?
Some code cleanup and adding comments to make the code more readable.
Changed the way to generate result rows, to be more clear.
## How was this patch tested?
existing tests
Author: Wenchen Fan
Closes #18570 from cloud-fan/summary.
commit 96d58f285bc98d4c2484150eefe7447db4784a86
Author: Eric Vandenberg
Date: 2017-07-10T06:40:20Z
[SPARK-21219][CORE] Task retry occurs on same executor due to race
condition with blacklisting
## What changes were proposed in this pull request?
There's a race condition in the current TaskSetManager where a failed task
is added for retry (addPendingTask), and can asynchronously be assigned to an
executor *prior* to the blacklist state (updateBlacklistForFailedTask), the
result is the task might re-execute on the same executor. This is particularly
problematic if the executor is shutting down since the retry task immediately
becomes a lost task (ExecutorLostFailure). Another side effect is that the
actual failure reason gets obscured by the retry task which never actually
executed. There are sample logs showing the issue in the
https://issues.apache.org/jira/browse/SPARK-21219
The fix is to change the ordering of the addPendingTask and
updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
## How was this patch tested?
Implemented a unit test that verifies the task is black listed before it is
added to the pending task. Ran the unit test without the fix and it fails.
Ran the unit test with the fix and it passes.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Eric Vandenberg
Closes #18427 from ericvandenbergfb/blacklistFix.
commit c444d10868c808f4ae43becd5506bf944d9c2e9b
Author: Dongjoon Hyun
Date: 2017-07-10T06:46:47Z
[MINOR][DOC] Remove obsolete `ec2-scripts.md`
## What changes were proposed in this pull request?
Since this document became obsolete, we had better remove this for Apache
Spark 2.3.0. The original document is removed via SPARK-12735 on January 2016,
and currently it's just redirection page. The only reference in Apache Spark
website will go directly to the destination in
https://github.com/apache/spark-website/pull/54.
## How was this patch tested?
N/A. This is a removal of documentation.
Author: Dongjoon Hyun
Closes #18578 from dongjoon-hyun/SPARK-REMOVE-EC2.
commit 647963a26a2d4468ebd9b68111ebe68bee501fde
Author: Takeshi Yamamuro
Date: 2017-07-10T07:58:34Z
[SPARK-20460][SQL] Make it more consistent to handle column name duplication
## What changes were proposed in this pull request?
This pr made it more consistent to handle column name duplication. In the
current master, error handling is different when hitting column name
duplication:
```
// json
scala> val schema = StructType(StructField("a", IntegerType) ::
StructField("a", IntegerType) :: Nil)
scala> Seq("""{"a":1,
"a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("json").schema(schema).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could
be: a#12, a#13.;
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
scala> spark.read.format("json").load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found,
cannot save to JSON format;
at
org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)