[GitHub] spark pull request #19446: Dataset optimization

2017-10-06 Thread sohum2002
Github user sohum2002 closed the pull request at:

https://github.com/apache/spark/pull/19446


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19446: Dataset optimization

2017-10-06 Thread sohum2002
GitHub user sohum2002 opened a pull request:

https://github.com/apache/spark/pull/19446

Dataset optimization

The proposed two new additional functions is to help select all the columns 
in a Dataset except for given columns.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sohum2002/spark dataset_optimization

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19446.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19446


commit 0e80ecae300f3e2033419b2d98da8bf092c105bb
Author: Wenchen Fan 
Date:   2017-07-10T05:53:27Z

[SPARK-21100][SQL][FOLLOWUP] cleanup code and add more comments for 
Dataset.summary

## What changes were proposed in this pull request?

Some code cleanup and adding comments to make the code more readable. 
Changed the way to generate result rows, to be more clear.

## How was this patch tested?

existing tests

Author: Wenchen Fan 

Closes #18570 from cloud-fan/summary.

commit 96d58f285bc98d4c2484150eefe7447db4784a86
Author: Eric Vandenberg 
Date:   2017-07-10T06:40:20Z

[SPARK-21219][CORE] Task retry occurs on same executor due to race 
condition with blacklisting

## What changes were proposed in this pull request?

There's a race condition in the current TaskSetManager where a failed task 
is added for retry (addPendingTask), and can asynchronously be assigned to an 
executor *prior* to the blacklist state (updateBlacklistForFailedTask), the 
result is the task might re-execute on the same executor.  This is particularly 
problematic if the executor is shutting down since the retry task immediately 
becomes a lost task (ExecutorLostFailure).  Another side effect is that the 
actual failure reason gets obscured by the retry task which never actually 
executed.  There are sample logs showing the issue in the 
https://issues.apache.org/jira/browse/SPARK-21219

The fix is to change the ordering of the addPendingTask and 
updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask

## How was this patch tested?

Implemented a unit test that verifies the task is black listed before it is 
added to the pending task.  Ran the unit test without the fix and it fails.  
Ran the unit test with the fix and it passes.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.

Author: Eric Vandenberg 

Closes #18427 from ericvandenbergfb/blacklistFix.

commit c444d10868c808f4ae43becd5506bf944d9c2e9b
Author: Dongjoon Hyun 
Date:   2017-07-10T06:46:47Z

[MINOR][DOC] Remove obsolete `ec2-scripts.md`

## What changes were proposed in this pull request?

Since this document became obsolete, we had better remove this for Apache 
Spark 2.3.0. The original document is removed via SPARK-12735 on January 2016, 
and currently it's just redirection page. The only reference in Apache Spark 
website will go directly to the destination in 
https://github.com/apache/spark-website/pull/54.

## How was this patch tested?

N/A. This is a removal of documentation.

Author: Dongjoon Hyun 

Closes #18578 from dongjoon-hyun/SPARK-REMOVE-EC2.

commit 647963a26a2d4468ebd9b68111ebe68bee501fde
Author: Takeshi Yamamuro 
Date:   2017-07-10T07:58:34Z

[SPARK-20460][SQL] Make it more consistent to handle column name duplication

## What changes were proposed in this pull request?
This pr made it more consistent to handle column name duplication. In the 
current master, error handling is different when hitting column name 
duplication:
```
// json
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("a", IntegerType) :: Nil)
scala> Seq("""{"a":1, 
"a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("json").schema(schema).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could 
be: a#12, a#13.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)

scala> spark.read.format("json").load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, 
cannot save to JSON format;
  at 
org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)