[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18924 **[Test build #82506 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82506/testReport)** for PR 18924 at commit [`a81dae5`](https://github.com/apache/spark/commit/a81dae574f2085ec390effd1b9b1962970f00239). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19082 **[Test build #82502 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82502/testReport)** for PR 19082 at commit [`1880dfd`](https://github.com/apache/spark/commit/1880dfdfedbdef11d39cb092202a6bc7db95e374). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19082 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19082 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82502/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...
Github user jsnowacki commented on the issue: https://github.com/apache/spark/pull/19370 @HyukjinKwon Commit squashed to one as you've requested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19370 **[Test build #82510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82510/testReport)** for PR 19370 at commit [`aec49a0`](https://github.com/apache/spark/commit/aec49a0f3027a7e2c0c83339232a37926db1d2dc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19370 Yup, it looks triggering fine - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1822-master although I wonder why check mark does not appear. I think it is not specific to this PR but rather AppVeyor itself though .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19440: [SPARK-21871][SQL] Fix infinite loop when bytecode size ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19440 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19442 **[Test build #82500 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82500/testReport)** for PR 19442 at commit [`de0aa76`](https://github.com/apache/spark/commit/de0aa76199141255258d9d5b12a0d31b1758c6f1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19442 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82500/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19442 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82501 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82501/testReport)** for PR 18732 at commit [`20fb1fe`](https://github.com/apache/spark/commit/20fb1fe9cbf033d73ecf2851f9cb1dc94f41fb3e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82501/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19444: [SPARK-22214][SQL] Refactor the list hive partiti...
GitHub user jiangxb1987 opened a pull request: https://github.com/apache/spark/pull/19444 [SPARK-22214][SQL] Refactor the list hive partitions code ## What changes were proposed in this pull request? In this PR we make a few changes to the list hive partitions code, to make the code more extensible. The following changes are made: 1. In `HiveClientImpl.getPartitions()`, call `client.getPartitions` instead of `shim.getAllPartitions` when `spec` is empty; 2. In `HiveTableScanExec`, previously we always call `listPartitionsByFilter` if the config `metastorePartitionPruning` is enabled, but actually, we'd better call `listPartitions` if `partitionPruningPred` is empty; 3. We should use sessionCatalog instead of SharedState.externalCatalog in `HiveTableScanExec`. ## How was this patch tested? Tested by existing test cases since this is code refactor, no regression or behavior change is expected. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiangxb1987/spark hivePartitions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19444.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19444 commit 8f50c7c47934a8dca662e8e2d5eacbc0b394eaa5 Author: Xingbo JiangDate: 2017-10-06T11:04:29Z refactor list hive partitions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19444 **[Test build #82509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82509/testReport)** for PR 19444 at commit [`8f50c7c`](https://github.com/apache/spark/commit/8f50c7c47934a8dca662e8e2d5eacbc0b394eaa5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19445: Dataset select all columns
GitHub user sohum2002 opened a pull request: https://github.com/apache/spark/pull/19445 Dataset select all columns The proposed two new additional functions is to help select all the columns in a Dataset except for given columns. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sohum2002/spark dataset_selectAllColumns Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19445.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19445 commit d35a1268d784a268e6137eff54eb8f83c981a289 Author: Burak YavuzDate: 2017-02-01T00:52:53Z [SPARK-19378][SS] Ensure continuity of stateOperator and eventTime metrics even if there is no new data in trigger In StructuredStreaming, if a new trigger was skipped because no new data arrived, we suddenly report nothing for the metrics `stateOperator`. We could however easily report the metrics from `lastExecution` to ensure continuity of metrics. Regression test in `StreamingQueryStatusAndProgressSuite` Author: Burak Yavuz Closes #16716 from brkyvz/state-agg. (cherry picked from commit 081b7addaf9560563af0ce25912972e91a78cee6) Signed-off-by: Tathagata Das commit 61cdc8c7cc8cfc57646a30da0e0df874a14e3269 Author: Zheng RuiFeng Date: 2017-02-01T13:27:20Z [SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning ## What changes were proposed in this pull request? Fix brokens links in ml-pipeline and ml-tuning `` -> `` ## How was this patch tested? manual tests Author: Zheng RuiFeng Closes #16754 from zhengruifeng/doc_api_fix. (cherry picked from commit 04ee8cf633e17b6bf95225a8dd77bf2e06980eb3) Signed-off-by: Sean Owen commit f946464155bb907482dc8d8a1b0964a925d04081 Author: Devaraj K Date: 2017-02-01T20:55:11Z [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED ## What changes were proposed in this pull request? Copying of the killed status was missing while getting the newTaskInfo object by dropping the unnecessary details to reduce the memory usage. This patch adds the copying of the killed status to newTaskInfo object, this will correct the display of the status from wrong status to KILLED status in Web UI. ## How was this patch tested? Current behaviour of displaying tasks in stage UI page, | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |143|10 |0 |SUCCESS|NODE_LOCAL |6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0 | |0.0 B / 0|TaskKilled (killed intentionally)| |156|11 |0 |SUCCESS|NODE_LOCAL |5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0 | |0.0 B / 0|TaskKilled (killed intentionally)| Web UI display after applying the patch, | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |143|10 |0 |KILLED |NODE_LOCAL |6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0 | | 0.0 B / 0 | TaskKilled (killed intentionally)| |156|11 |0 |KILLED |NODE_LOCAL |5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27 |0 ms | |0.0 B / 0 | |0.0 B / 0 | TaskKilled (killed intentionally)| Author: Devaraj K Closes #16725 from devaraj-kavali/SPARK-19377. (cherry picked from commit df4a27cc5cae8e251ba2a883bcc5f5ce9282f649) Signed-off-by: Shixiong Zhu commit 7c23bd49e826fc2b7f132ffac2e55a71905abe96 Author: Shixiong Zhu Date: 2017-02-02T05:39:21Z [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout ## What changes were proposed in this pull request? When connecting timeout, `ask` may fail with a confusing message: ``` 17/02/01 23:15:19 INFO Worker: Connecting to master ... java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet been set. at scala.Predef$.require(Predef.scala:224) at
[GitHub] spark pull request #19445: Dataset select all columns
Github user sohum2002 closed the pull request at: https://github.com/apache/spark/pull/19445 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19444: [SPARK-22214][SQL] Refactor the list hive partiti...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19444#discussion_r143168926 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -638,12 +638,14 @@ private[hive] class HiveClientImpl( table: CatalogTable, spec: Option[TablePartitionSpec]): Seq[CatalogTablePartition] = withHiveState { val hiveTable = toHiveTable(table, Some(userName)) -val parts = spec match { - case None => shim.getAllPartitions(client, hiveTable).map(fromHivePartition) --- End diff -- After this change, `HiveShim.getAllPartitions` is only used to support `HiveShim.getPartitionsByFilter` for hive 0.12, we may consider completely remove the method in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19446: Dataset optimization
GitHub user sohum2002 opened a pull request: https://github.com/apache/spark/pull/19446 Dataset optimization The proposed two new additional functions is to help select all the columns in a Dataset except for given columns. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sohum2002/spark dataset_optimization Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19446.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19446 commit 0e80ecae300f3e2033419b2d98da8bf092c105bb Author: Wenchen FanDate: 2017-07-10T05:53:27Z [SPARK-21100][SQL][FOLLOWUP] cleanup code and add more comments for Dataset.summary ## What changes were proposed in this pull request? Some code cleanup and adding comments to make the code more readable. Changed the way to generate result rows, to be more clear. ## How was this patch tested? existing tests Author: Wenchen Fan Closes #18570 from cloud-fan/summary. commit 96d58f285bc98d4c2484150eefe7447db4784a86 Author: Eric Vandenberg Date: 2017-07-10T06:40:20Z [SPARK-21219][CORE] Task retry occurs on same executor due to race condition with blacklisting ## What changes were proposed in this pull request? There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask ## How was this patch tested? Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg Closes #18427 from ericvandenbergfb/blacklistFix. commit c444d10868c808f4ae43becd5506bf944d9c2e9b Author: Dongjoon Hyun Date: 2017-07-10T06:46:47Z [MINOR][DOC] Remove obsolete `ec2-scripts.md` ## What changes were proposed in this pull request? Since this document became obsolete, we had better remove this for Apache Spark 2.3.0. The original document is removed via SPARK-12735 on January 2016, and currently it's just redirection page. The only reference in Apache Spark website will go directly to the destination in https://github.com/apache/spark-website/pull/54. ## How was this patch tested? N/A. This is a removal of documentation. Author: Dongjoon Hyun Closes #18578 from dongjoon-hyun/SPARK-REMOVE-EC2. commit 647963a26a2d4468ebd9b68111ebe68bee501fde Author: Takeshi Yamamuro Date: 2017-07-10T07:58:34Z [SPARK-20460][SQL] Make it more consistent to handle column name duplication ## What changes were proposed in this pull request? This pr made it more consistent to handle column name duplication. In the current master, error handling is different when hitting column name duplication: ``` // json scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("""{"a":1, "a":1}"").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("json").schema(schema).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) scala> spark.read.format("json").load("/tmp/data").show org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format; at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
[GitHub] spark pull request #19446: Dataset optimization
Github user sohum2002 closed the pull request at: https://github.com/apache/spark/pull/19446 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18924 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18924 **[Test build #82505 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82505/testReport)** for PR 18924 at commit [`f181496`](https://github.com/apache/spark/commit/f1814965885e0c82a71287f5e5912e11b126b8a4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18924 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82505/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19370 @jsnowacki, would you mind if I ask squash those commits into single one so that we can check if the squashed commit, having the changes in `appveyor.yml` and `*.cmd`, actually triggers AppVeyor test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19370 Otherwise, looks good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19399: [SPARK-22175][WEB-UI] Add status column to histor...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19399#discussion_r143114423 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -487,8 +487,10 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) protected def mergeApplicationListing(fileStatus: FileStatus): Unit = { val eventsFilter: ReplayEventsFilter = { eventString => eventString.startsWith(APPL_START_EVENT_PREFIX) || -eventString.startsWith(APPL_END_EVENT_PREFIX) || -eventString.startsWith(LOG_START_EVENT_PREFIX) + eventString.startsWith(APPL_END_EVENT_PREFIX) || + eventString.startsWith(LOG_START_EVENT_PREFIX) || + eventString.startsWith(JOB_START_EVENT_PREFIX) || + eventString.startsWith(JOB_END_EVENT_PREFIX) --- End diff -- Actually i have not do any benchmark test for this modify.But it has been tested with our production cluster. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19399: [SPARK-22175][WEB-UI] Add status column to history page
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19399 Ok i will wait for SPARK-18085 and think about log status more accurately @squito @ajbozarth Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19442 **[Test build #82495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82495/testReport)** for PR 19442 at commit [`77ced95`](https://github.com/apache/spark/commit/77ced957e7be2169ac0c59c76f60ab9d4fcac3ef). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19442 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19440: [SPARK-21871][SQL] Fix infinite loop when bytecode size ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19440 **[Test build #82494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82494/testReport)** for PR 19440 at commit [`b8eb6a0`](https://github.com/apache/spark/commit/b8eb6a0e45ceb9592fbbf32a236aa17cd3e5dac0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19082 sure, I will look into this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19442 **[Test build #82500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82500/testReport)** for PR 19442 at commit [`de0aa76`](https://github.com/apache/spark/commit/de0aa76199141255258d9d5b12a0d31b1758c6f1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143122396 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -1015,6 +1020,10 @@ object DateTimeUtils { guess } + def convertTz(ts: SQLTimestamp, fromZone: String, toZone: String): SQLTimestamp = { +convertTz(ts, getTimeZone(fromZone), getTimeZone(toZone)) --- End diff -- performance is going to suck here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143122317 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: Expression) } /** + * This modifies a timestamp to show how the display time changes going from one timezone to + * another, for the same instant in time. + * + * We intentionally do not provide an ExpressionDescription as this is not meant to be exposed to + * users, its only used for internal conversions. + */ +private[spark] case class TimestampTimezoneCorrection( --- End diff -- do we need a whole expression for this? can't we just reuse existing expressions? It's just simple arithmetics isn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19340: [SPARK-22119][ML] Add cosine distance to KMeans
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19340 I'm kind of neutral given the complexity of adding this, but maybe it's the least complexity you can get away with. @hhbyyh was adding something related: https://issues.apache.org/jira/browse/SPARK-22195 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparision should respect case-...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18460 Thank you for review, @gatorsmile . The following is a result from Hive 1.2.2. ```sql hive> CREATE TABLE T AS SELECT named_struct('a',1); hive> CREATE TABLE S AS SELECT named_struct('A',1); hive> SELECT * FROM T UNION ALL SELECT * FROM S; {"a":1} {"a":1} ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19442 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19442 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82503/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparision should respect case-...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18460 @gatorsmile . I updated the previous comment with more examples. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19442 **[Test build #82503 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82503/testReport)** for PR 19442 at commit [`de0aa76`](https://github.com/apache/spark/commit/de0aa76199141255258d9d5b12a0d31b1758c6f1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18924 **[Test build #82505 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82505/testReport)** for PR 18924 at commit [`f181496`](https://github.com/apache/spark/commit/f1814965885e0c82a71287f5e5912e11b126b8a4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19443 **[Test build #82507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82507/testReport)** for PR 19443 at commit [`9e52c63`](https://github.com/apache/spark/commit/9e52c6380ae8787d20e3442cfaf42cfb70caf4dc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19443 **[Test build #82507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82507/testReport)** for PR 19443 at commit [`9e52c63`](https://github.com/apache/spark/commit/9e52c6380ae8787d20e3442cfaf42cfb70caf4dc). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19443 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82507/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19443 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19370 **[Test build #82508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82508/testReport)** for PR 19370 at commit [`5f52c79`](https://github.com/apache/spark/commit/5f52c791cda81323ac985ce18796ea4131c30923). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143159334 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends LDAOptimizer { val expElogbetaBc = batch.sparkContext.broadcast(expElogbeta) val alpha = this.alpha.asBreeze val gammaShape = this.gammaShape - -val stats: RDD[(BDM[Double], List[BDV[Double]])] = batch.mapPartitions { docs => +val optimizeDocConcentration = this.optimizeDocConcentration +// If and only if optimizeDocConcentration is set true, +// we calculate logphat in the same pass as other statistics. +// No calculation of loghat happens otherwise. +val logphatPartOptionBase = () => if (optimizeDocConcentration) { +Some(BDV.zeros[Double](k)) + } else { +None + } + +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0) val stat = BDM.zeros[Double](k, vocabSize) - var gammaPart = List[BDV[Double]]() + val logphatPartOption = logphatPartOptionBase() + var nonEmptyDocCount : Long = 0L nonEmptyDocs.foreach { case (_, termCounts: Vector) => +nonEmptyDocCount += 1 val (gammad, sstats, ids) = OnlineLDAOptimizer.variationalTopicInference( termCounts, expElogbetaBc.value, alpha, gammaShape, k) -stat(::, ids) := stat(::, ids).toDenseMatrix + sstats -gammaPart = gammad :: gammaPart +stat(::, ids) := stat(::, ids) + sstats +logphatPartOption.foreach(_ += LDAUtils.dirichletExpectation(gammad)) } - Iterator((stat, gammaPart)) -}.persist(StorageLevel.MEMORY_AND_DISK) -val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( - _ += _, _ += _) -val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( - stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) -stats.unpersist() + Iterator((stat, logphatPartOption, nonEmptyDocCount)) +} + +val elementWiseSum = (u : (BDM[Double], Option[BDV[Double]], Long), + v : (BDM[Double], Option[BDV[Double]], Long)) => { --- End diff -- I see now. Thank you. But seems like the style guide suggests to move both of the parameters to the new line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 So shall we ping @jkbradley, shan't we? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18924 **[Test build #82506 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82506/testReport)** for PR 18924 at commit [`a81dae5`](https://github.com/apache/spark/commit/a81dae574f2085ec390effd1b9b1962970f00239). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in...
GitHub user jsnowacki opened a pull request: https://github.com/apache/spark/pull/19443 [SPARK-22212][SQL][PySpark] Some SQL functions in Python fail with string column name ## What changes were proposed in this pull request? The issue in JIRA: [SPARK-22212](https://issues.apache.org/jira/browse/SPARK-22212) Most of the functions in `pyspark.sql.functions` allow usage of both column name string and `Column` object. But there are some functions, like `trim`, that require to pass only `Column`. See below code for explanation. ``` >>> import pyspark.sql.functions as func >>> df = spark.createDataFrame([tuple(l) for l in "abcde"], ["text"]) >>> df.select(func.trim(df["text"])).show() +--+ |trim(text)| +--+ | a| | b| | c| | d| | e| +--+ >>> df.select(func.trim("text")).show() [...] Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.trim. Trace: py4j.Py4JException: Method trim([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) ``` This is because most of the Python function calls map column name to `Column` in the Python function mapping, but functions created via `_create_function` pass them as is, if they are not `Column`. On the other hand, few functions that require the column name has been moved `functions_by_column_name`, and are created by `_create_function_by_column_name`. Note that this is only Python-side fix. Some Scala functions still do not have method to call them by string column name. ## How was this patch tested? Additional Python tests where written to accommodate this. It was tested via `UnitTest` in IDE and the overall `python\run_tests` script. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jsnowacki/spark-1 fix_func_str_to_col Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19443.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19443 commit c5dbd50361a37e9833708dc8985345fbf537e8d9 Author: Jakub NowackiDate: 2017-10-03T07:50:50Z [SPARK-22212] Fixing string to column mapping in Python functions commit 9e52c6380ae8787d20e3442cfaf42cfb70caf4dc Author: Jakub Nowacki Date: 2017-10-06T09:07:26Z [SPARK-22212] Calling functions by string column name fixed and tested --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...
Github user jsnowacki commented on the issue: https://github.com/apache/spark/pull/19370 I've added `- bin/*.cmd` to the AppVeyor file. Please let me know if this is sufficient. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143313284 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +@since(2.3) +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. For each group, all columns are passed together as a `pandas.DataFrame` +to the user-function and the returned `pandas.DataFrame` are combined as a +:class:`DataFrame`. The returned `pandas.DataFrame` can be arbitrary length and its schema +must match the returnType of the pandas udf. + +:param udf: A wrapped udf function returned by :meth:`pyspark.sql.functions.pandas_udf` + +>>> from pyspark.sql.functions import pandas_udf +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP ++---+---+ +| id| v| ++---+---+ +| 1|-0.7071067811865475| +| 1| 0.7071067811865475| +| 2|-0.8320502943378437| +| 2|-0.2773500981126146| +| 2| 1.1094003924504583| ++---+---+ + +.. seealso:: :meth:`pyspark.sql.functions.pandas_udf` + +""" +from pyspark.sql.functions import pandas_udf + +# Columns are special because hasattr always return True +if isinstance(udf, Column) or not hasattr(udf, 'func') or not udf.vectorized: +raise ValueError("The argument to apply must be a pandas_udf") +if not isinstance(udf.returnType, StructType): +raise ValueError("The returnType of the pandas_udf must be a StructType") + +df = self._df +func = udf.func +returnType = udf.returnType --- End diff -- is it necessary to make all these copies? I could understand maybe copying `func` and `columns` because they are in the wrapped function, but not sure if `df` and `returnType` need to be copied --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19082 Sure. I'm totally agreed. We need to know the advantages and possible impacts if any when merging this PR and #18931,. It is good @kiszk and @rednaxelafx can help review this PR and #18931. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18931 **[Test build #82531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82531/testReport)** for PR 18931 at commit [`601c225`](https://github.com/apache/spark/commit/601c2251c397b30f2ea9a42f6a23e3636129d5bc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19394: [SPARK-22170][SQL] Reduce memory consumption in broadcas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19394 **[Test build #82532 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82532/testReport)** for PR 19394 at commit [`56089f5`](https://github.com/apache/spark/commit/56089f5ba65f1d7d9e11b76673bcde3df37cd240). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18732 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19082 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82499/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19082 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19082 **[Test build #82502 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82502/testReport)** for PR 19082 at commit [`1880dfd`](https://github.com/apache/spark/commit/1880dfdfedbdef11d39cb092202a6bc7db95e374). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19452: [SPARK-22136][SS] Evaluate one-sided conditions early in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19452 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82533/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19452: [SPARK-22136][SS] Evaluate one-sided conditions early in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19452 **[Test build #82533 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82533/testReport)** for PR 19452 at commit [`8c2a39f`](https://github.com/apache/spark/commit/8c2a39fcb3e425a91d25505ae9d29ba8ac670e0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class JoinConditionSplitPredicates(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19452: [SPARK-22136][SS] Evaluate one-sided conditions early in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19452 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19443 This might look okay within Python side because the fix looks minimised and does not actually increase complexity much; however, I think we focus on API consistency between other languages in general. In this sense, I think we tend to avoid adding those with string parameters in Scala side, please see https://github.com/apache/spark/pull/18144#issuecomment-304960488, https://github.com/apache/spark/pull/18144#issuecomment-304926567 and https://github.com/apache/spark/pull/18144#issuecomment-304955155. I am -0 on this because the workaround is simple anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19294 **[Test build #82504 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82504/testReport)** for PR 19294 at commit [`e41abc6`](https://github.com/apache/spark/commit/e41abc65c3ffeaec8c03c0d093a5c5efcd30c17e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19294 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82504/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19294: [SPARK-21549][CORE] Respect OutputFormats with no output...
Github user szhem commented on the issue: https://github.com/apache/spark/pull/19294 @mridulm sql-related tests were removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 Thanks @gatorsmile for the constructive feedback! I don't want to make this more complicated but I also want to make sure we are aware that there is also difference between Arrow/non-Arrow version when treating array and sstruct type: Array: ``` non-Arrow: In [47]: type(df2.toPandas().array[0]) Out[47]: list Arrow: In [45]: type(df2.toPandas().array[0]) Out[45]: numpy.ndarray ``` Struct: ``` Arrow: In [35]: type(df.toPandas().struct[0]) Out[35]: pyspark.sql.types.Row non-Arrow: In [37]: type(df.toPandas().struct[0]) Out[37]: dict ``` I think there should be a high level doc capturing all differences between Arrow/non-Arrow version. Unfortunately I cannot commit much time until Nov but I am happy for help with review and discussion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org