[GitHub] spark pull request #19463: Cleanup comment in RDDSuite test
Github user sohum2002 closed the pull request at: https://github.com/apache/spark/pull/19463 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19463: Cleanup comment in RDDSuite test
Github user sohum2002 commented on the issue: https://github.com/apache/spark/pull/19463 I just added "Removed one comment from RDDSuite." to the PR description. Will this suffice? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19463: Cleanup comment in RDDSuite test
GitHub user sohum2002 opened a pull request: https://github.com/apache/spark/pull/19463 Cleanup comment in RDDSuite test ## What changes were proposed in this pull request? There were not changes proposed in this pull request. ## How was this patch tested? There were not tests in this pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sohum2002/spark cleanup-RDDSuite-test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19463.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19463 commit c83ab1e5c51311ecb293e47e9c9694a9a49cfbaa Author: Sachathamakul, Patrachai (Agoda) <patrachai.sachathama...@agoda.com> Date: 2017-10-10T03:14:27Z Cleanup comment in RDDSuite test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...
Github user sohum2002 closed the pull request at: https://github.com/apache/spark/pull/19454 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...
Github user sohum2002 commented on the issue: https://github.com/apache/spark/pull/19454 Thank you all for your comments. I hope to improve in my future PRs. Cheers! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...
Github user sohum2002 commented on the issue: https://github.com/apache/spark/pull/19454 @HyukjinKwon - Thank you for your comments and analysis of this PR. I will also try to improve the `flatMap(identity)` as mentioned by @srowen. Also, will add a python implementation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Github user sohum2002 commented on the issue: https://github.com/apache/spark/pull/19454 Would appreciate some help in the Python implementation of the `flatten` function as I have never used pyspark. Could someone help me out? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19454: Added flatten functions for RDD and Dataset
GitHub user sohum2002 opened a pull request: https://github.com/apache/spark/pull/19454 Added flatten functions for RDD and Dataset ## What changes were proposed in this pull request? This PR creates a _flatten_ function in two places: RDD and Dataset classes. This PR resolves the following issues: SPARK-22152 and SPARK-18855. Author: Sohum Sachdev <sohum2...@hotmail.com> You can merge this pull request into a Git repository by running: $ git pull https://github.com/sohum2002/spark SPARK-18855_SPARK-18855 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19454.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19454 commit 075e7ef3f27af91c5190d039770cf15b08a66c81 Author: Sachathamakul, Patrachai (Agoda) <patrachai.sachathama...@agoda.com> Date: 2017-10-08T10:24:44Z Added flatten functions for RDD and Dataset --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19453: Added selectAllColumns function in Dataset class
Github user sohum2002 closed the pull request at: https://github.com/apache/spark/pull/19453 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19453: Added selectAllColumns function in Dataset class
Github user sohum2002 commented on the issue: https://github.com/apache/spark/pull/19453 @srowen - This is a good point, let me close this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19453: Added selectAllColumns function in Dataset class
GitHub user sohum2002 opened a pull request: https://github.com/apache/spark/pull/19453 Added selectAllColumns function in Dataset class The proposed two new additional functions is to help select all the columns in a Dataset except for given columns. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sohum2002/spark dataset-selectAllColumns Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19453.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19453 commit 015ac9f31e6b59df41d5827fc1969570ae1b4af3 Author: Sachathamakul, Patrachai (Agoda) <patrachai.sachathama...@agoda.com> Date: 2017-10-07T06:09:32Z added selectAllColumns function in Dataset class --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19446: Dataset optimization
Github user sohum2002 closed the pull request at: https://github.com/apache/spark/pull/19446 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19446: Dataset optimization
GitHub user sohum2002 opened a pull request: https://github.com/apache/spark/pull/19446 Dataset optimization The proposed two new additional functions is to help select all the columns in a Dataset except for given columns. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sohum2002/spark dataset_optimization Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19446.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19446 commit 0e80ecae300f3e2033419b2d98da8bf092c105bb Author: Wenchen Fan <wenc...@databricks.com> Date: 2017-07-10T05:53:27Z [SPARK-21100][SQL][FOLLOWUP] cleanup code and add more comments for Dataset.summary ## What changes were proposed in this pull request? Some code cleanup and adding comments to make the code more readable. Changed the way to generate result rows, to be more clear. ## How was this patch tested? existing tests Author: Wenchen Fan <wenc...@databricks.com> Closes #18570 from cloud-fan/summary. commit 96d58f285bc98d4c2484150eefe7447db4784a86 Author: Eric Vandenberg <ericvandenb...@fb.com> Date: 2017-07-10T06:40:20Z [SPARK-21219][CORE] Task retry occurs on same executor due to race condition with blacklisting ## What changes were proposed in this pull request? There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask ## How was this patch tested? Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenb...@fb.com> Closes #18427 from ericvandenbergfb/blacklistFix. commit c444d10868c808f4ae43becd5506bf944d9c2e9b Author: Dongjoon Hyun <dongj...@apache.org> Date: 2017-07-10T06:46:47Z [MINOR][DOC] Remove obsolete `ec2-scripts.md` ## What changes were proposed in this pull request? Since this document became obsolete, we had better remove this for Apache Spark 2.3.0. The original document is removed via SPARK-12735 on January 2016, and currently it's just redirection page. The only reference in Apache Spark website will go directly to the destination in https://github.com/apache/spark-website/pull/54. ## How was this patch tested? N/A. This is a removal of documentation. Author: Dongjoon Hyun <dongj...@apache.org> Closes #18578 from dongjoon-hyun/SPARK-REMOVE-EC2. commit 647963a26a2d4468ebd9b68111ebe68bee501fde Author: Takeshi Yamamuro <yamam...@apache.org> Date: 2017-07-10T07:58:34Z [SPARK-20460][SQL] Make it more consistent to handle column name duplication ## What changes were proposed in this pull request? This pr made it more consistent to handle column name duplication. In the current master, error handling is different when hitting column name duplication: ``` // json scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("json").schema(schema).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) scala> spark.read.format("json").load("/tmp/data").show org.apache.spark
[GitHub] spark pull request #19445: Dataset select all columns
Github user sohum2002 closed the pull request at: https://github.com/apache/spark/pull/19445 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19445: Dataset select all columns
GitHub user sohum2002 opened a pull request: https://github.com/apache/spark/pull/19445 Dataset select all columns The proposed two new additional functions is to help select all the columns in a Dataset except for given columns. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sohum2002/spark dataset_selectAllColumns Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19445.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19445 commit d35a1268d784a268e6137eff54eb8f83c981a289 Author: Burak Yavuz <brk...@gmail.com> Date: 2017-02-01T00:52:53Z [SPARK-19378][SS] Ensure continuity of stateOperator and eventTime metrics even if there is no new data in trigger In StructuredStreaming, if a new trigger was skipped because no new data arrived, we suddenly report nothing for the metrics `stateOperator`. We could however easily report the metrics from `lastExecution` to ensure continuity of metrics. Regression test in `StreamingQueryStatusAndProgressSuite` Author: Burak Yavuz <brk...@gmail.com> Closes #16716 from brkyvz/state-agg. (cherry picked from commit 081b7addaf9560563af0ce25912972e91a78cee6) Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com> commit 61cdc8c7cc8cfc57646a30da0e0df874a14e3269 Author: Zheng RuiFeng <ruife...@foxmail.com> Date: 2017-02-01T13:27:20Z [SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning ## What changes were proposed in this pull request? Fix brokens links in ml-pipeline and ml-tuning `` -> `` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruife...@foxmail.com> Closes #16754 from zhengruifeng/doc_api_fix. (cherry picked from commit 04ee8cf633e17b6bf95225a8dd77bf2e06980eb3) Signed-off-by: Sean Owen <so...@cloudera.com> commit f946464155bb907482dc8d8a1b0964a925d04081 Author: Devaraj K <deva...@apache.org> Date: 2017-02-01T20:55:11Z [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED ## What changes were proposed in this pull request? Copying of the killed status was missing while getting the newTaskInfo object by dropping the unnecessary details to reduce the memory usage. This patch adds the copying of the killed status to newTaskInfo object, this will correct the display of the status from wrong status to KILLED status in Web UI. ## How was this patch tested? Current behaviour of displaying tasks in stage UI page, | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |143|10 |0 |SUCCESS|NODE_LOCAL |6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0 | |0.0 B / 0|TaskKilled (killed intentionally)| |156|11 |0 |SUCCESS|NODE_LOCAL |5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0 | |0.0 B / 0|TaskKilled (killed intentionally)| Web UI display after applying the patch, | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |143|10 |0 |KILLED |NODE_LOCAL |6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0 | | 0.0 B / 0 | TaskKilled (killed intentionally)| |156|11 |0 |KILLED |NODE_LOCAL |5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0 | |0.0 B / 0 | TaskKilled (killed intentionally)| Author: Devaraj K <deva...@apache.org> Closes #16725 from devaraj-kavali/SPARK-19377. (cherry picked from commit df4a27cc5cae8e251ba2a883bcc5f5ce9282f649) Signed-off-by: Shixiong Zhu <shixi...@databricks.com> commit 7c23bd49e826fc2b7f132ffac2e55a71905abe96 Author: Shixiong Zhu <shixi...@databricks.com> Date: 2017-02-02T05:39:21Z [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout ## What changes were proposed in this pull request? When connecting timeout, `ask` may fail with a confusing message: ``` 17/02/01 23:15:19 INFO Worker: Connecting to master ... java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet