[GitHub] spark pull request #19463: Cleanup comment in RDDSuite test

2017-10-10 Thread sohum2002
Github user sohum2002 closed the pull request at:

https://github.com/apache/spark/pull/19463


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19463: Cleanup comment in RDDSuite test

2017-10-09 Thread sohum2002
Github user sohum2002 commented on the issue:

https://github.com/apache/spark/pull/19463
  
I just added "Removed one comment from RDDSuite." to the PR description. 
Will this suffice?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19463: Cleanup comment in RDDSuite test

2017-10-09 Thread sohum2002
GitHub user sohum2002 opened a pull request:

https://github.com/apache/spark/pull/19463

Cleanup comment in RDDSuite test

## What changes were proposed in this pull request?

There were not changes proposed in this pull request.

## How was this patch tested?

There were not tests in this pull request.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sohum2002/spark cleanup-RDDSuite-test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19463.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19463


commit c83ab1e5c51311ecb293e47e9c9694a9a49cfbaa
Author: Sachathamakul, Patrachai (Agoda) <patrachai.sachathama...@agoda.com>
Date:   2017-10-10T03:14:27Z

Cleanup comment in RDDSuite test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

2017-10-09 Thread sohum2002
Github user sohum2002 closed the pull request at:

https://github.com/apache/spark/pull/19454


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

2017-10-09 Thread sohum2002
Github user sohum2002 commented on the issue:

https://github.com/apache/spark/pull/19454
  
Thank you all for your comments. I hope to improve in my future PRs. Cheers!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

2017-10-09 Thread sohum2002
Github user sohum2002 commented on the issue:

https://github.com/apache/spark/pull/19454
  
@HyukjinKwon - Thank you for your comments and analysis of this PR. I will 
also try to improve the `flatMap(identity)` as mentioned by @srowen. Also, will 
add a python implementation. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

2017-10-09 Thread sohum2002
Github user sohum2002 commented on the issue:

https://github.com/apache/spark/pull/19454
  
Would appreciate some help in the Python implementation of the `flatten` 
function as I have never used pyspark. Could someone help me out?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19454: Added flatten functions for RDD and Dataset

2017-10-08 Thread sohum2002
GitHub user sohum2002 opened a pull request:

https://github.com/apache/spark/pull/19454

Added flatten functions for RDD and Dataset

## What changes were proposed in this pull request?
This PR creates a _flatten_ function in two places: RDD and Dataset 
classes. This PR resolves the following issues: SPARK-22152 and SPARK-18855.

Author: Sohum Sachdev <sohum2...@hotmail.com>

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sohum2002/spark SPARK-18855_SPARK-18855

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19454.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19454


commit 075e7ef3f27af91c5190d039770cf15b08a66c81
Author: Sachathamakul, Patrachai (Agoda) <patrachai.sachathama...@agoda.com>
Date:   2017-10-08T10:24:44Z

Added flatten functions for RDD and Dataset




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19453: Added selectAllColumns function in Dataset class

2017-10-07 Thread sohum2002
Github user sohum2002 closed the pull request at:

https://github.com/apache/spark/pull/19453


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19453: Added selectAllColumns function in Dataset class

2017-10-07 Thread sohum2002
Github user sohum2002 commented on the issue:

https://github.com/apache/spark/pull/19453
  
@srowen - This is a good point, let me close this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19453: Added selectAllColumns function in Dataset class

2017-10-07 Thread sohum2002
GitHub user sohum2002 opened a pull request:

https://github.com/apache/spark/pull/19453

Added selectAllColumns function in Dataset class

The proposed two new additional functions is to help select all the columns 
in a Dataset except for given columns.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sohum2002/spark dataset-selectAllColumns

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19453.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19453


commit 015ac9f31e6b59df41d5827fc1969570ae1b4af3
Author: Sachathamakul, Patrachai (Agoda) <patrachai.sachathama...@agoda.com>
Date:   2017-10-07T06:09:32Z

added selectAllColumns function in Dataset class




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19446: Dataset optimization

2017-10-06 Thread sohum2002
Github user sohum2002 closed the pull request at:

https://github.com/apache/spark/pull/19446


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19446: Dataset optimization

2017-10-06 Thread sohum2002
GitHub user sohum2002 opened a pull request:

https://github.com/apache/spark/pull/19446

Dataset optimization

The proposed two new additional functions is to help select all the columns 
in a Dataset except for given columns.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sohum2002/spark dataset_optimization

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19446.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19446


commit 0e80ecae300f3e2033419b2d98da8bf092c105bb
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2017-07-10T05:53:27Z

[SPARK-21100][SQL][FOLLOWUP] cleanup code and add more comments for 
Dataset.summary

## What changes were proposed in this pull request?

Some code cleanup and adding comments to make the code more readable. 
Changed the way to generate result rows, to be more clear.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenc...@databricks.com>

Closes #18570 from cloud-fan/summary.

commit 96d58f285bc98d4c2484150eefe7447db4784a86
Author: Eric Vandenberg <ericvandenb...@fb.com>
Date:   2017-07-10T06:40:20Z

[SPARK-21219][CORE] Task retry occurs on same executor due to race 
condition with blacklisting

## What changes were proposed in this pull request?

There's a race condition in the current TaskSetManager where a failed task 
is added for retry (addPendingTask), and can asynchronously be assigned to an 
executor *prior* to the blacklist state (updateBlacklistForFailedTask), the 
result is the task might re-execute on the same executor.  This is particularly 
problematic if the executor is shutting down since the retry task immediately 
becomes a lost task (ExecutorLostFailure).  Another side effect is that the 
actual failure reason gets obscured by the retry task which never actually 
executed.  There are sample logs showing the issue in the 
https://issues.apache.org/jira/browse/SPARK-21219

The fix is to change the ordering of the addPendingTask and 
updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask

## How was this patch tested?

Implemented a unit test that verifies the task is black listed before it is 
added to the pending task.  Ran the unit test without the fix and it fails.  
Ran the unit test with the fix and it passes.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.

Author: Eric Vandenberg <ericvandenb...@fb.com>

Closes #18427 from ericvandenbergfb/blacklistFix.

commit c444d10868c808f4ae43becd5506bf944d9c2e9b
Author: Dongjoon Hyun <dongj...@apache.org>
Date:   2017-07-10T06:46:47Z

[MINOR][DOC] Remove obsolete `ec2-scripts.md`

## What changes were proposed in this pull request?

Since this document became obsolete, we had better remove this for Apache 
Spark 2.3.0. The original document is removed via SPARK-12735 on January 2016, 
and currently it's just redirection page. The only reference in Apache Spark 
website will go directly to the destination in 
https://github.com/apache/spark-website/pull/54.

## How was this patch tested?

N/A. This is a removal of documentation.

Author: Dongjoon Hyun <dongj...@apache.org>

Closes #18578 from dongjoon-hyun/SPARK-REMOVE-EC2.

commit 647963a26a2d4468ebd9b68111ebe68bee501fde
Author: Takeshi Yamamuro <yamam...@apache.org>
Date:   2017-07-10T07:58:34Z

[SPARK-20460][SQL] Make it more consistent to handle column name duplication

## What changes were proposed in this pull request?
This pr made it more consistent to handle column name duplication. In the 
current master, error handling is different when hitting column name 
duplication:
```
// json
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("a", IntegerType) :: Nil)
scala> Seq("""{"a":1, 
"a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
scala> spark.read.format("json").schema(schema).load("/tmp/data").show
org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could 
be: a#12, a#13.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)

scala> spark.read.format("json").load("/tmp/data").show
org.apache.spark

[GitHub] spark pull request #19445: Dataset select all columns

2017-10-06 Thread sohum2002
Github user sohum2002 closed the pull request at:

https://github.com/apache/spark/pull/19445


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19445: Dataset select all columns

2017-10-06 Thread sohum2002
GitHub user sohum2002 opened a pull request:

https://github.com/apache/spark/pull/19445

Dataset select all columns

The proposed two new additional functions is to help select all the columns 
in a Dataset except for given columns.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sohum2002/spark dataset_selectAllColumns

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19445.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19445


commit d35a1268d784a268e6137eff54eb8f83c981a289
Author: Burak Yavuz <brk...@gmail.com>
Date:   2017-02-01T00:52:53Z

[SPARK-19378][SS] Ensure continuity of stateOperator and eventTime metrics 
even if there is no new data in trigger

In StructuredStreaming, if a new trigger was skipped because no new data 
arrived, we suddenly report nothing for the metrics `stateOperator`. We could 
however easily report the metrics from `lastExecution` to ensure continuity of 
metrics.

Regression test in `StreamingQueryStatusAndProgressSuite`

Author: Burak Yavuz <brk...@gmail.com>

Closes #16716 from brkyvz/state-agg.

(cherry picked from commit 081b7addaf9560563af0ce25912972e91a78cee6)
Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 61cdc8c7cc8cfc57646a30da0e0df874a14e3269
Author: Zheng RuiFeng <ruife...@foxmail.com>
Date:   2017-02-01T13:27:20Z

[SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning

## What changes were proposed in this pull request?
Fix brokens links in ml-pipeline and ml-tuning
``  ->   ``

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruife...@foxmail.com>

Closes #16754 from zhengruifeng/doc_api_fix.

(cherry picked from commit 04ee8cf633e17b6bf95225a8dd77bf2e06980eb3)
Signed-off-by: Sean Owen <so...@cloudera.com>

commit f946464155bb907482dc8d8a1b0964a925d04081
Author: Devaraj K <deva...@apache.org>
Date:   2017-02-01T20:55:11Z

[SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED

## What changes were proposed in this pull request?

Copying of the killed status was missing while getting the newTaskInfo 
object by dropping the unnecessary details to reduce the memory usage. This 
patch adds the copying of the killed status to newTaskInfo object, this will 
correct the display of the status from wrong status to KILLED status in Web UI.

## How was this patch tested?

Current behaviour of displaying tasks in stage UI page,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | 
Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle 
Write Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
--- |
|143|10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x 
stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  | 
|0.0 B / 0|TaskKilled (killed intentionally)|
|156|11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x 
stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  | 
|0.0 B / 0|TaskKilled (killed intentionally)|

Web UI display after applying the patch,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | 
Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle 
Write Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
--- |
|143|10 |0  |KILLED |NODE_LOCAL |6 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  |  | 0.0 B / 
0  | TaskKilled (killed intentionally)|
|156|11 |0  |KILLED |NODE_LOCAL |5 / x.xx.x.x stdout 
stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0  |  |0.0 B / 
0   | TaskKilled (killed intentionally)|

Author: Devaraj K <deva...@apache.org>

Closes #16725 from devaraj-kavali/SPARK-19377.

(cherry picked from commit df4a27cc5cae8e251ba2a883bcc5f5ce9282f649)
Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 7c23bd49e826fc2b7f132ffac2e55a71905abe96
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2017-02-02T05:39:21Z

[SPARK-19432][CORE] Fix an unexpected failure when connecting timeout

## What changes were proposed in this pull request?

When connecting timeout, `ask` may fail with a confusing message:

```
17/02/01 23:15:19 INFO Worker: Connecting to master ...
java.lang.IllegalArgumentException: requirement failed: TransportClient has 
not yet