date:20170806

[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18810
  
**[Test build #80325 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80325/testReport)**
 for PR 18810 at commit 
[`7e84753`](https://github.com/apache/spark/commit/7e84753ca9befc8f3cea872250b2145e132ac837).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17357: [SPARK-20025][CORE] Ignore SPARK_LOCAL* env, whil...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17357#discussion_r131582810
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala ---
@@ -139,7 +139,9 @@ private[rest] class StandaloneSubmitRequestServlet(
 val driverExtraLibraryPath = 
sparkProperties.get("spark.driver.extraLibraryPath")
 val superviseDriver = sparkProperties.get("spark.driver.supervise")
 val appArgs = request.appArgs
-val environmentVariables = request.environmentVariables
+// Filter SPARK_LOCAL environment variables from being set on the 
remote system.
+val environmentVariables =
+  
request.environmentVariables.filterNot(_._1.startsWith("SPARK_LOCAL"))
--- End diff --

I guess that Driver might not use `SPARK_LOCAL_DIRS`. But yes we may only 
need to filter out `SPARK_LOCAL_IP` and `SPARK_LOCAL_HOSTNAME`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18866: [SPARK-21649][SQL] Support writing data into hive...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18866#discussion_r131582440
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
@@ -262,7 +262,12 @@ case class HashPartitioning(expressions: 
Seq[Expression], numPartitions: Int)
* Returns an expression that will produce a valid partition ID(i.e. 
non-negative and is less
* than numPartitions) based on hashing expressions.
*/
-  def partitionIdExpression: Expression = Pmod(new 
Murmur3Hash(expressions), Literal(numPartitions))
+  def partitionIdExpression(useHiveHash: Boolean = false): Expression =
+if (useHiveHash) {
+  Pmod(new HiveHash(expressions), Literal(numPartitions))
--- End diff --

I saw that `HiveHash simulates Hive's hashing function from Hive 
v1.2.1...`. Is there any compatibility issue for Hive before 1.2.1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18801: SPARK-10878 Fix race condition when multiple clients res...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18801
  
**[Test build #80324 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80324/testReport)**
 for PR 18801 at commit 
[`1ace5cc`](https://github.com/apache/spark/commit/1ace5cc8232536bcc336042aec686fed1204f799).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18801: SPARK-10878 Fix race condition when multiple clients res...

2017-08-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18801
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18846: [SPARK-21642][CORE] Use FQDN for DRIVER_HOST_ADDRESS ins...

2017-08-06 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/18846
  
Should we also apply this change to `RpcEnv` ? @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12147: [SPARK-14361][SQL]Window function exclude clause

2017-08-06 Thread xwu0226

Github user xwu0226 commented on the issue:

https://github.com/apache/spark/pull/12147
  
@HyukjinKwon My rebased branch has broken most of the window exclude test 
cases. Trying to fix. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18846: [SPARK-21642][CORE] Use FQDN for DRIVER_HOST_ADDRESS ins...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18846
  
**[Test build #80323 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80323/testReport)**
 for PR 18846 at commit 
[`afc07ee`](https://github.com/apache/spark/commit/afc07ee14974a38c3b6912dfd2943084d25eeccf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18846: [SPARK-21642][CORE] Use FQDN for DRIVER_HOST_ADDRESS ins...

2017-08-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18846
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-08-06 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18865
  
cc @gatorsmile @cloud-fan Can you help trigger Jenkins for this? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18865
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18866: [SPARK-21649][SQL] Support writing data into hive bucket...

2017-08-06 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/18866
  
I added the unit test referring 
(https://github.com/apache/hive/blob/branch-1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/AbstractBucketJoinProc.java#L393).
Hive will sort bucket files by file names when do SMB join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18866: [SPARK-21649][SQL] Support writing data into hive bucket...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18866
  
**[Test build #80322 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80322/testReport)**
 for PR 18866 at commit 
[`51d2c11`](https://github.com/apache/spark/commit/51d2c110d01b8a4ef1d53d144c443e0e9b43817b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-08-06 Thread jmchung

GitHub user jmchung reopened a pull request:

https://github.com/apache/spark/pull/18865

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

## What changes were proposed in this pull request?
```
echo '{"field": 1}
{"field": 2}
{"field": "3"}' >/tmp/sample.json
```

```scala
import org.apache.spark.sql.types._

val schema = new StructType()
  .add("field", ByteType)
  .add("_corrupt_record", StringType)

val file = "/tmp/sample.json"

val dfFromFile = spark.read.schema(schema).json(file)

scala> dfFromFile.show(false)
+-+---+
|field|_corrupt_record|
+-+---+
|1|null   |
|2|null   |
|null |{"field": "3"} |
+-+---+

scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
res1: Long = 0

scala> dfFromFile.filter($"_corrupt_record".isNull).count()
res2: Long = 3
```
When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
When users requires only `_corrupt_record`, we assume that the corrupt records 
are required for all json fields.

## How was this patch tested?

Added test case.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-21610

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18865.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18865


commit 09aa76cc228162edba7ece45063592cd17ae4a27
Author: Jen-Ming Chung 
Date:   2017-08-07T03:52:45Z

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

commit f73c3874a9e6a35344a3dc8f6ec8cfb17a1be2f8
Author: Jen-Ming Chung 
Date:   2017-08-07T04:39:36Z

add explanation to schema change and minor refactor in test case

commit 7a595984f16f6c998883f271bf63e2e84af5f046
Author: Jen-Ming Chung 
Date:   2017-08-07T04:59:07Z

move test case from DataFrameReaderWriterSuite to JsonSuite

commit 97290f0f891f4261bf173c5ff596d0bb33168d57
Author: Jen-Ming Chung 
Date:   2017-08-07T05:41:15Z

filter not _corrupt_record in dataSchema

commit f5eec40d51bec8ed0f79f52c5a408ba98f26ca1a
Author: Jen-Ming Chung 
Date:   2017-08-07T06:17:48Z

code refactor




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18866: [SPARK-21649][SQL] Support writing data into hive...

2017-08-06 Thread jinxing64

GitHub user jinxing64 opened a pull request:

https://github.com/apache/spark/pull/18866

[SPARK-21649][SQL] Support writing data into hive bucket table.

## What changes were proposed in this pull request?

Support writing hive bucket table. Spark internally uses Murmur3Hash for 
partitioning. We can use hive hash for compatibility when write to bucket table.

## How was this patch tested?

Unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jinxing64/spark SPARK-21649

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18866.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18866


commit 51d2c110d01b8a4ef1d53d144c443e0e9b43817b
Author: jinxing 
Date:   2017-08-07T04:12:56Z

Support writing data into hive bucket table.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-08-06 Thread jmchung

Github user jmchung closed the pull request at:

https://github.com/apache/spark/pull/18865


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-08-06 Thread jmchung

GitHub user jmchung opened a pull request:

https://github.com/apache/spark/pull/18865

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

## What changes were proposed in this pull request?
```
echo '{"field": 1}
{"field": 2}
{"field": "3"}' >/tmp/sample.json
```

```scala
import org.apache.spark.sql.types._

val schema = new StructType()
  .add("field", ByteType)
  .add("_corrupt_record", StringType)

val file = "/tmp/sample.json"

val dfFromFile = spark.read.schema(schema).json(file)

scala> dfFromFile.show(false)
+-+---+
|field|_corrupt_record|
+-+---+
|1|null   |
|2|null   |
|null |{"field": "3"} |
+-+---+

scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
res1: Long = 0

scala> dfFromFile.filter($"_corrupt_record".isNull).count()
res2: Long = 3
```
When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
When users requires only `_corrupt_record`, we assume that the corrupt records 
are required for all json fields.

## How was this patch tested?

Added test case.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-21610

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18865.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18865


commit 09aa76cc228162edba7ece45063592cd17ae4a27
Author: Jen-Ming Chung 
Date:   2017-08-07T03:52:45Z

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

commit f73c3874a9e6a35344a3dc8f6ec8cfb17a1be2f8
Author: Jen-Ming Chung 
Date:   2017-08-07T04:39:36Z

add explanation to schema change and minor refactor in test case

commit 7a595984f16f6c998883f271bf63e2e84af5f046
Author: Jen-Ming Chung 
Date:   2017-08-07T04:59:07Z

move test case from DataFrameReaderWriterSuite to JsonSuite

commit 97290f0f891f4261bf173c5ff596d0bb33168d57
Author: Jen-Ming Chung 
Date:   2017-08-07T05:41:15Z

filter not _corrupt_record in dataSchema

commit f5eec40d51bec8ed0f79f52c5a408ba98f26ca1a
Author: Jen-Ming Chung 
Date:   2017-08-07T06:17:48Z

code refactor




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-08-06 Thread kevinyu98

Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r131578995
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -1121,6 +1125,30 @@ class AstBuilder(conf: SQLConf) extends 
SqlBaseBaseVisitor[AnyRef] with Logging
   }
 
   /**
+   * Create a function name LTRIM for TRIM(Leading), RTRIM for 
TRIM(Trailing), TRIM for TRIM(BOTH),
+   * otherwise, return the original function identifier.
+   */
+  private def replaceTrimFunction(funcID: FunctionIdentifier, ctx: 
FunctionCallContext)
+: FunctionIdentifier = {
--- End diff --

ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-08-06 Thread kevinyu98

Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r131579031
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -1121,6 +1125,30 @@ class AstBuilder(conf: SQLConf) extends 
SqlBaseBaseVisitor[AnyRef] with Logging
   }
 
   /**
+   * Create a function name LTRIM for TRIM(Leading), RTRIM for 
TRIM(Trailing), TRIM for TRIM(BOTH),
+   * otherwise, return the original function identifier.
+   */
+  private def replaceTrimFunction(funcID: FunctionIdentifier, ctx: 
FunctionCallContext)
+: FunctionIdentifier = {
+val opt = ctx.trimOption
+if (opt != null) {
+  if (ctx.qualifiedName.getText.toLowerCase != "trim") {
+throw new ParseException(s"The specified function 
${ctx.qualifiedName.getText} " +
+  s"doesn't support with option ${opt.getText}.", ctx)
+  }
+  opt.getType match {
+case SqlBaseParser.BOTH => funcID
+case SqlBaseParser.LEADING => funcID.copy(funcName = "ltrim")
+case SqlBaseParser.TRAILING => funcID.copy(funcName = "rtrim")
+case _ => throw new ParseException(s"Function trim doesn't support 
with" +
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-08-06 Thread kevinyu98

Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r131578980
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2304,7 +2304,15 @@ object functions {
* @group string_funcs
* @since 1.5.0
*/
-  def ltrim(e: Column): Column = withExpr {StringTrimLeft(e.expr) }
+  def ltrim(e: Column): Column = withExpr {StringTrimLeft(e.expr)}
--- End diff --

sure, I will change.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131578382
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1393,6 +1393,16 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null| null|
 ++--+-+
 
+>>> df4.na.replace('Alice', None).show()
--- End diff --

OK. I'm fine with this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18861
  
**[Test build #80321 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80321/testReport)**
 for PR 18861 at commit 
[`c0306d3`](https://github.com/apache/spark/commit/c0306d346e336a3bae6335e27f676c3254d915cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r131576044
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -356,6 +356,16 @@ class CodegenContext {
   private val placeHolderToComments = new mutable.HashMap[String, String]
 
   /**
+   * Returns the length of codegen function  is too long or not
+   */
+  def existTooLongFunction(): Boolean = {
--- End diff --

Adding a checking logics here, instead of returning `Boolean`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18810#discussion_r131575786
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -370,6 +370,12 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 
   override def doExecute(): RDD[InternalRow] = {
 val (ctx, cleanedSource) = doCodeGen()
+val existLongFunction = ctx.existTooLongFunction
+if (existLongFunction) {
+  logWarning(s"Function is too long, Whole-stage codegen disabled for 
this plan:\n "
++ s"$treeString")
--- End diff --

This could be very big. Please follow what did in 
https://github.com/apache/spark/pull/18658


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-06 Thread facaiy

Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
@SparkQA Take a test, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18861
  
**[Test build #80320 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80320/testReport)**
 for PR 18861 at commit 
[`413b0eb`](https://github.com/apache/spark/commit/413b0eb55659d31cd21fbc1c858d3da1603d2248).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18864: [SPARK-21648] [SQL] Fix confusing assert failure in JDBC...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18864
  
**[Test build #80319 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80319/testReport)**
 for PR 18864 at commit 
[`e4aac50`](https://github.com/apache/spark/commit/e4aac502d58972063a1ab25f17a1c217abe97b97).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18864: [SPARK-21648] [SQL] Fix confusing assert failure in JDBC...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18864
  
cc @zsxwing @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18864: [SPARK-21648] [SQL] Fix confusing assert failure ...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18864#discussion_r131574704
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -29,17 +29,22 @@ class JdbcRelationProvider extends 
CreatableRelationProvider
   override def createRelation(
   sqlContext: SQLContext,
   parameters: Map[String, String]): BaseRelation = {
+import JDBCOptions._
+
 val jdbcOptions = new JDBCOptions(parameters)
 val partitionColumn = jdbcOptions.partitionColumn
 val lowerBound = jdbcOptions.lowerBound
 val upperBound = jdbcOptions.upperBound
 val numPartitions = jdbcOptions.numPartitions
 
 val partitionInfo = if (partitionColumn.isEmpty) {
-  assert(lowerBound.isEmpty && upperBound.isEmpty)
--- End diff --

cc @dongjoon-hyun 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18864: [SPARK-21648] [SQL] Fix confusing assert failure ...

2017-08-06 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/18864

[SPARK-21648] [SQL] Fix confusing assert failure in JDBC source when 
parallel fetching parameters are not properly provided.

### What changes were proposed in this pull request?
```SQL
CREATE TABLE mytesttable1 
USING org.apache.spark.sql.jdbc 
  OPTIONS ( 
  url 
'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}',
 
  dbtable 'mytesttable1', 
  paritionColumn 'state_id', 
  lowerBound '0', 
  upperBound '52', 
  numPartitions '53', 
  fetchSize '1' 
)
```

The above option name `paritionColumn` is wrong. That mean, users did not 
provide the value for `partitionColumn`. In such case, users hit a confusing 
error.

```
AssertionError: assertion failed
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312)
```


### How was this patch tested?
Added a test case

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark jdbcPartCol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18864.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18864


commit e4aac502d58972063a1ab25f17a1c217abe97b97
Author: gatorsmile 
Date:   2017-08-05T05:38:15Z

improve message.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18855: [SPARK-3151][Block Manager] DiskStore.getBytes fails for...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18855
  
Yea, please refer 
http://apache-spark-developers-list.1001551.n3.nabble.com/Some-PRs-not-automatically-linked-to-JIRAs-td22067.html
 Looks some problems related with it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18830: [SPARK-21621][Core] Reset numRecordsWritten after DiskBl...

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18830
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18830: [SPARK-21621][Core] Reset numRecordsWritten after DiskBl...

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18830
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80316/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18810
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18810
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80318/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18810
  
**[Test build #80318 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80318/testReport)**
 for PR 18810 at commit 
[`1b0ac5e`](https://github.com/apache/spark/commit/1b0ac5ed896136df3579a61d7ef93980c0647e97).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18830: [SPARK-21621][Core] Reset numRecordsWritten after DiskBl...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18830
  
**[Test build #80316 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80316/testReport)**
 for PR 18830 at commit 
[`d82401d`](https://github.com/apache/spark/commit/d82401d1771009e02a81152b70b4fa48ce077593).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18810
  
**[Test build #80318 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80318/testReport)**
 for PR 18810 at commit 
[`1b0ac5e`](https://github.com/apache/spark/commit/1b0ac5ed896136df3579a61d7ef93980c0647e97).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18810
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18576: [SPARK-21351][SQL] Update nullability based on ch...

2017-08-06 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18576#discussion_r131573104
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -94,27 +94,14 @@ case class FilterExec(condition: Expression, child: 
SparkPlan)
 case _ => false
   }
 
-  // If one expression and its children are null intolerant, it is null 
intolerant.
-  private def isNullIntolerant(expr: Expression): Boolean = expr match {
-case e: NullIntolerant => e.children.forall(isNullIntolerant)
-case _ => false
-  }
-
-  // The columns that will filtered out by `IsNotNull` could be considered 
as not nullable.
-  private val notNullAttributes = 
notNullPreds.flatMap(_.references).distinct.map(_.exprId)
-
   // Mark this as empty. We'll evaluate the input during doConsume(). We 
don't want to evaluate
   // all the variables at the beginning to take advantage of short 
circuiting.
   override def usedInputs: AttributeSet = AttributeSet.empty
 
+  // Since some plan rewrite rules (e.g., python.ExtractPythonUDFs) 
possibly change child's output
+  // from optimized logical plans, we need to adjust the filter's output 
here.
   override def output: Seq[Attribute] = {
-child.output.map { a =>
-  if (a.nullable && notNullAttributes.contains(a.exprId)) {
-a.withNullability(false)
-  } else {
-a
-  }
-}
+child.output.map { attr => outputAttrs.find(_.exprId == 
attr.exprId).getOrElse(attr) }
   }
--- End diff --

I simply tried to drop updating nullability and reuse output attributes 
`outputAttrs` in an optimized logical plan here though, some python tests 
failed (all the scala tests passed). I checked and I found this; in the planner 
path of python, we have some cases changing operator's output from the 
optimized logical plan to a physical plan.
For example;
```
sql("""SELECT strlen(a) FROM test WHERE strlen(a) > 1""")

// pyspark
>>> spark.sql("SELECT strlen(a) FROM test WHERE strlen(a) > 
1").explain(True)
...
== Optimized Logical Plan ==
Project [strlen(a#0) AS strlen(a)#30]
+- Filter (strlen(a#0) > 1)
   +- LogicalRDD [a#0]

== Physical Plan ==
*Project [pythonUDF0#34 AS strlen(a)#30]
+- BatchEvalPython [strlen(a#0)], [a#0, pythonUDF0#34]
   +- *Filter (pythonUDF0#33 > 1), [a#0]
  +- BatchEvalPython [strlen(a#0)], [a#0, pythonUDF0#33]
 +- Scan ExistingRDD[a#0]
```
So, I added code to check a difference between `outputAttrs` and 
`child.output`.
Could you give me insight on this? @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18862: [SPARK-21640][FOLLOW-UP] added errorifexists on IllegalA...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18862
  
**[Test build #80317 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80317/testReport)**
 for PR 18862 at commit 
[`592ab60`](https://github.com/apache/spark/commit/592ab60742497e5c8157b19bb03a0315e90fb039).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18862: [SPARK-21640][FOLLOW-UP] added errorifexists on IllegalA...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18862
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17583: [SPARK-20271]Add FuncTransformer to simplify custom tran...

2017-08-06 Thread hhbyyh

Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/17583
  
A gentle ping since I think this is quite helpful. 
@jkbradley @MLnick @yanboliang @srowen @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131572498
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1393,6 +1393,16 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null| null|
 ++--+-+
 
+>>> df4.na.replace('Alice', None).show()
--- End diff --

I assume we removed `dataframe.replace` to promote use 
`dataframe.na.replace`? The doc says they are aliases anyway. I don't know but 
I tend to agree with paring doc tests and this looks removed in 
https://github.com/apache/spark/commit/ff26767c03cc76e7e86b238300367fa0d9b3e6a4.

Let's leave this as is for now. I don't want to make this PR complicated.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18733: [SPARK-21535][ML]Reduce memory requirement for Cr...

2017-08-06 Thread hhbyyh

Github user hhbyyh closed the pull request at:

https://github.com/apache/spark/pull/18733


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131571620
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -753,6 +753,16 @@ case class Repartition(numPartitions: Int, shuffle: 
Boolean, child: LogicalPlan)
 }
 
 /**
+ * Returns a new RDD that has at most `numPartitions` partitions. This 
behavior can be modified by
+ * supplying a `PartitionCoalescer` to control the behavior of the 
partitioning.
+ */
+case class PartitionCoalesce(numPartitions: Int, coalescer: 
PartitionCoalescer, child: LogicalPlan)
+  extends UnaryNode {
--- End diff --

yea, I think so. I'll try and plz give me days to do so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131571547
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends 
PartitionCoalescer with Seria
   totalSum += splitSize
 }
 
-while (index < partitions.size) {
+while (index < partitions.length) {
   val partition = partitions(index)
-  val fileSplit =
-
partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
-  val splitSize = fileSplit.getLength
+  val splitSize = getPartitionSize(partition)
   if (currentSum + splitSize < maxSize) {
 addPartition(partition, splitSize)
 index += 1
-if (index == partitions.size) {
-  updateGroups
+if (index == partitions.length) {
+  updateGroups()
 }
   } else {
-if (currentGroup.partitions.size == 0) {
+if (currentGroup.partitions.isEmpty) {
   addPartition(partition, splitSize)
   index += 1
 } else {
-  updateGroups
+  updateGroups()
--- End diff --

ok, I'll drop these from this pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131571449
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends 
PartitionCoalescer with Seria
   totalSum += splitSize
 }
 
-while (index < partitions.size) {
+while (index < partitions.length) {
   val partition = partitions(index)
-  val fileSplit =
-
partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
-  val splitSize = fileSplit.getLength
+  val splitSize = getPartitionSize(partition)
   if (currentSum + splitSize < maxSize) {
 addPartition(partition, splitSize)
 index += 1
-if (index == partitions.size) {
-  updateGroups
+if (index == partitions.length) {
+  updateGroups()
 }
   } else {
-if (currentGroup.partitions.size == 0) {
+if (currentGroup.partitions.isEmpty) {
   addPartition(partition, splitSize)
   index += 1
 } else {
-  updateGroups
+  updateGroups()
--- End diff --

I am fine about this, but it might confuse the others. Maybe just remove 
them in this PR? You can submit a separate PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131571248
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends 
PartitionCoalescer with Seria
   totalSum += splitSize
 }
 
-while (index < partitions.size) {
+while (index < partitions.length) {
   val partition = partitions(index)
-  val fileSplit =
-
partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
-  val splitSize = fileSplit.getLength
+  val splitSize = getPartitionSize(partition)
   if (currentSum + splitSize < maxSize) {
 addPartition(partition, splitSize)
 index += 1
-if (index == partitions.size) {
-  updateGroups
+if (index == partitions.length) {
+  updateGroups()
 }
   } else {
-if (currentGroup.partitions.size == 0) {
+if (currentGroup.partitions.isEmpty) {
   addPartition(partition, splitSize)
   index += 1
 } else {
-  updateGroups
+  updateGroups()
--- End diff --

Yea, I just left the changes of the original author (probably refactoring 
stuffs?) ..., better remove this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131571005
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1393,6 +1393,16 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null| null|
 ++--+-+
 
+>>> df4.na.replace('Alice', None).show()
--- End diff --

I've not noticed that. Why we test `dataframe.na.replace` at the doc test 
of `dataframe.replace`? We should test `dataframe.replace` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131570851
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends 
PartitionCoalescer with Seria
   totalSum += splitSize
 }
 
-while (index < partitions.size) {
+while (index < partitions.length) {
   val partition = partitions(index)
-  val fileSplit =
-
partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
-  val splitSize = fileSplit.getLength
+  val splitSize = getPartitionSize(partition)
   if (currentSum + splitSize < maxSize) {
 addPartition(partition, splitSize)
 index += 1
-if (index == partitions.size) {
-  updateGroups
+if (index == partitions.length) {
+  updateGroups()
 }
   } else {
-if (currentGroup.partitions.size == 0) {
+if (currentGroup.partitions.isEmpty) {
   addPartition(partition, splitSize)
   index += 1
 } else {
-  updateGroups
+  updateGroups()
--- End diff --

All the above changes are not related to this PR, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131570879
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -753,6 +753,16 @@ case class Repartition(numPartitions: Int, shuffle: 
Boolean, child: LogicalPlan)
 }
 
 /**
+ * Returns a new RDD that has at most `numPartitions` partitions. This 
behavior can be modified by
+ * supplying a `PartitionCoalescer` to control the behavior of the 
partitioning.
+ */
+case class PartitionCoalesce(numPartitions: Int, coalescer: 
PartitionCoalescer, child: LogicalPlan)
+  extends UnaryNode {
--- End diff --

Adding new logical nodes also needs the updates in multiple different 
components. (e.g., Optimizer). 

Is that possible to reuse the existing node `Repartition`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131570565
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -571,7 +570,8 @@ case class UnionExec(children: Seq[SparkPlan]) extends 
SparkPlan {
  * current upstream partitions will be executed in parallel (per whatever
  * the current partitioning is).
  */
-case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends 
UnaryExecNode {
+case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: 
Option[PartitionCoalescer])
--- End diff --

ok!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18576: [SPARK-21351][SQL] Update nullability based on children'...

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18576
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80315/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18576: [SPARK-21351][SQL] Update nullability based on children'...

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18576
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18576: [SPARK-21351][SQL] Update nullability based on children'...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18576
  
**[Test build #80315 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80315/testReport)**
 for PR 18576 at commit 
[`5d2fd6d`](https://github.com/apache/spark/commit/5d2fd6db8dc4130a948e5bb4d09fe0f776d16145).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class FilterExec(condition: Expression, child: SparkPlan, 
outputAttrs: Seq[Attribute])`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131570472
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -571,7 +570,8 @@ case class UnionExec(children: Seq[SparkPlan]) extends 
SparkPlan {
  * current upstream partitions will be executed in parallel (per whatever
  * the current partitioning is).
  */
-case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends 
UnaryExecNode {
+case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: 
Option[PartitionCoalescer])
--- End diff --

Could you add the parm description of `coalescer`? also update function 
descriptions? Thanks~!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131570273
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala ---
@@ -261,5 +261,18 @@ class DataFrameNaFunctionsSuite extends QueryTest with 
SharedSQLContext {
 assert(out1(3).get(2).asInstanceOf[Double].isNaN)
 assert(out1(4) === Row("Amy", null, null))
 assert(out1(5) === Row(null, null, null))
+
+// Replace with null
+val out2 = input.na.replace("name", Map(
+  "Bob" -> "Bravo",
+  "Alice" -> null
--- End diff --

Agree. Please try to improve the test case coverage. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18820
  
cc @ueshin Could you also take a look the code changes in the Python side? 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18820
  
Could you also add a test case to cover the end-to-end use case the JIRA 
mentioned? Also put it in the PR description, which will be part of the PR 
commit. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131570031
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala ---
@@ -314,6 +316,7 @@ final class DataFrameNaFunctions private[sql](df: 
DataFrame) {
* (Scala-specific) Replaces values matching keys in `replacement` map.
* Key and value of `replacement` map must have the same type, and
* can only be doubles, strings or booleans.
+   * `replacement` map value can have null.
--- End diff --

Do not put it here. It should be put in @parm. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131569954
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala ---
@@ -366,11 +370,15 @@ final class DataFrameNaFunctions private[sql](df: 
DataFrame) {
   return df
 }
 
-// replacementMap is either Map[String, String] or Map[Double, Double] 
or Map[Boolean,Boolean]
-val replacementMap: Map[_, _] = replacement.head._2 match {
-  case v: String => replacement
-  case v: Boolean => replacement
-  case _ => replacement.map { case (k, v) => (convertToDouble(k), 
convertToDouble(v)) }
+// replacementMap is either Map[String, String], Map[Double, Double], 
Map[Boolean,Boolean]
+// while value can have null
--- End diff --

If the types are not these three types, what are the behaviors? Could you 
explain them here? Also, please add negative examples too. Thanks~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131569819
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala ---
@@ -145,8 +145,8 @@ class DataTypeSuite extends SparkFunSuite {
 val message = intercept[SparkException] {
   left.merge(right)
 }.getMessage
-assert(message.equals("Failed to merge fields 'b' and 'b'. " +
-  "Failed to merge incompatible data types FloatType and LongType"))
+assert(message === "Failed to merge fields 'b' and 'b'. " +
--- End diff --

Nit: not related to this PR. Please revert it back. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18820
  
@bravo-zhang Could you update the PR description to explain what this PR is 
trying to achieve? So far, it is not clear enough to explain what you did in 
this PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131569456
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1393,6 +1393,16 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null| null|
 ++--+-+
 
+>>> df4.na.replace('Alice', None).show()
--- End diff --

I guess it is `.na.replace` vs `.replace`. I think both should be the same 
though. I just built against this PR and double checked as below:

```python
>>> df = spark.createDataFrame([('Alice', 10, 80.0)]).replace("Alice")
```
```python
>>> df.replace("Alice").first()
```
```
Row(_1=None, _2=10, _3=80.0)
```
```python
>>> df.na.replace("Alice").first()
```

```
Traceback (most recent call last):
  File "", line 1, in 
TypeError: replace() takes at least 3 arguments (2 given)
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131569039
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1393,6 +1393,16 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null| null|
 ++--+-+
 
+>>> df4.na.replace('Alice', None).show()
--- End diff --

This change allows us to do `df4.na.replace('Alice')`. I think SPARK-19454 
doesn't?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131568706
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1393,6 +1393,16 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null| null|
 ++--+-+
 
+>>> df4.na.replace('Alice', None).show()
--- End diff --

and was thinking of not doing this here as strictly it should be a followup 
for SPARK-19454. I am fine with doing this here too while we are here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20783][SQL] Create CachedBatchColumnVector to abs...

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18468
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20783][SQL] Create CachedBatchColumnVector to abs...

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18468
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80314/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20783][SQL] Create CachedBatchColumnVector to abs...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18468
  
**[Test build #80314 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80314/testReport)**
 for PR 18468 at commit 
[`a26dc15`](https://github.com/apache/spark/commit/a26dc150f6b95cc42558561cd2548de04a89f041).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131568383
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1393,6 +1393,16 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null| null|
 ++--+-+
 
+>>> df4.na.replace('Alice', None).show()
--- End diff --

Actually, I think this should be something to be fixed in 
`DataFrameNaFunctions.replace` in this file ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18769: [SPARK-21574][SQL] Point out user to set hive config bef...

2017-08-06 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/18769
  
@gatorsmile Docs syntax issues was fixed by 
https://github.com/apache/spark/pull/18793.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18833: [SPARK-21625][SQL] sqrt(negative number) should b...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18833#discussion_r131568132
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala
 ---
@@ -403,11 +403,13 @@ class MathExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
   test("sqrt") {
 testUnary(Sqrt, math.sqrt, (0 to 20).map(_ * 0.1))
-testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNaN = true)
+testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNull = true)
--- End diff --

We have `IsNaN`. So users might already use it to check those invalid 
values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131567901
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1393,6 +1393,16 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null| null|
 ++--+-+
 
+>>> df4.na.replace('Alice', None).show()
--- End diff --

Looks like now we allow something like `df4.na.replace('Alice').show()`. 
We're better add it here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131567720
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -1964,6 +1964,16 @@ def test_replace(self):
.replace(False, True).first())
 self.assertTupleEqual(row, (True, True))
 
+# replace with None
+row = self.spark.createDataFrame(
+[(u'Alice', 10, 80.0)], schema).replace(u'Alice', None).first()
+self.assertTupleEqual(row, (None, 10, 80.0))
+
+# replace with numerics and None
+row = self.spark.createDataFrame(
+[(u'Alice', 10, 80.0)], schema).replace([10, 80], [20, 
None]).first()
+self.assertTupleEqual(row, (u'Alice', 20, None))
--- End diff --

Can you add a test that `to_replace` is a list and `value` is not given (so 
as default value `None`)? Previously this will arise a `ValueError`. But now it 
is a valid usage. We are better to add a test explicitly for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18863: [SPARK-21647] [SQL] Fix SortMergeJoin when using CROSS

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18863
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80313/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18863: [SPARK-21647] [SQL] Fix SortMergeJoin when using CROSS

2017-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18863
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18863: [SPARK-21647] [SQL] Fix SortMergeJoin when using CROSS

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18863
  
**[Test build #80313 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80313/testReport)**
 for PR 18863 at commit 
[`f351fb1`](https://github.com/apache/spark/commit/f351fb1cbda8104f4f7e6ffa0be07f26b290683e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18833: [SPARK-21625][SQL] sqrt(negative number) should b...

2017-08-06 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/18833#discussion_r131566961
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala
 ---
@@ -403,11 +403,13 @@ class MathExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
   test("sqrt") {
 testUnary(Sqrt, math.sqrt, (0 to 20).map(_ * 0.1))
-testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNaN = true)
+testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNull = true)
--- End diff --

Users always use `is not null` filter invalid value, but spark sql breaks:
```
 > create table spark_21625 as select 10 as c1, sqrt(-10) as c2;
spark-sql> select * from spark_21625;
10  NaN 

spark-sql> select * from spark_21625 where c2 is not null;
10  NaN
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18833: [SPARK-21625][SQL] sqrt(negative number) should b...

2017-08-06 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/18833#discussion_r131566102
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala
 ---
@@ -403,11 +403,13 @@ class MathExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
   test("sqrt") {
 testUnary(Sqrt, math.sqrt, (0 to 20).map(_ * 0.1))
-testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNaN = true)
+testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNull = true)
--- End diff --

Yes, we migrated Hive and MySQL SQLs to Spark, and found some 
inconsistencies. `NaN` is unfamiliar to MySQL and Oracle users


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18833: [SPARK-21625][SQL] sqrt(negative number) should b...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18833#discussion_r131566009
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala
 ---
@@ -403,11 +403,13 @@ class MathExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
   test("sqrt") {
 testUnary(Sqrt, math.sqrt, (0 to 20).map(_ * 0.1))
-testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNaN = true)
+testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNull = true)
--- End diff --

Yea, I was writing this comment. If `NaN` makes sense in a way, I was 
thinking we couldn't consider this case as a bug that should be fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131566016
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala ---
@@ -261,5 +261,18 @@ class DataFrameNaFunctionsSuite extends QueryTest with 
SharedSQLContext {
 assert(out1(3).get(2).asInstanceOf[Double].isNaN)
 assert(out1(4) === Row("Amy", null, null))
 assert(out1(5) === Row(null, null, null))
+
+// Replace with null
+val out2 = input.na.replace("name", Map(
+  "Bob" -> "Bravo",
+  "Alice" -> null
--- End diff --

I saw you allow a replacement like (k: Double, null). Can you also add a 
test for such replacement? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18833: [SPARK-21625][SQL] sqrt(negative number) should b...

2017-08-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18833#discussion_r131565248
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala
 ---
@@ -403,11 +403,13 @@ class MathExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 
   test("sqrt") {
 testUnary(Sqrt, math.sqrt, (0 to 20).map(_ * 0.1))
-testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNaN = true)
+testUnary(Sqrt, math.sqrt, (-5 to -1).map(_ * 1.0), expectNull = true)
--- End diff --

Looks like you're changing the NaN cases for many math expressions to Null. 
I'm not sure if we can do thing like this to break compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18813: [SPARK-21567][SQL] Dataset should work with type alias

2017-08-06 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18813
  
ping @cloud-fan @hvanhovell Can you help to review this change? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem i...

2017-08-06 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/18641
  
ping @cloud-fan


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18853: [SPARK-21646][SQL] BinaryComparison shouldn't auto cast ...

2017-08-06 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/18853
  
How about casting the `int` values into `string` ones in that case you 
described in the description, and then comparing them by a lexicographical 
order?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18474: [SPARK-21235][TESTS] UTest should clear temp results whe...

2017-08-06 Thread wangjiaochun

Github user wangjiaochun commented on the issue:

https://github.com/apache/spark/pull/18474
  
Yes, Running this on Windows7.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18710: [SPARK][Docs] Added note on meaning of position to subst...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18710
  
gentle ping @maclockard.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18111: [SPARK-20886][CORE] HadoopMapReduceCommitProtocol...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18111#discussion_r131562253
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -73,7 +73,10 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: 
String)
 
 val stagingDir: String = committer match {
   // For FileOutputCommitter it has its own staging path called "work 
path".
-  case f: FileOutputCommitter => 
Option(f.getWorkPath.toString).getOrElse(path)
+  case f: FileOutputCommitter =>
+val workPath = f.getWorkPath
+require(workPath != null, s"Committer has no workpath $f")
+Option(workPath.toString).getOrElse(path)
--- End diff --

I wonder the answer to this question ^ actually .. Wouldn't 
`Option(...).getOrElse(path)` be unnecessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18830: [SPARK-21621][Core] Reset numRecordsWritten after DiskBl...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18830
  
**[Test build #80316 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80316/testReport)**
 for PR 18830 at commit 
[`d82401d`](https://github.com/apache/spark/commit/d82401d1771009e02a81152b70b4fa48ce077593).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18474: [SPARK-21235][TESTS] UTest should clear temp results whe...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18474
  
@wangjiaochun Are you running this on Windows?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18830: [SPARK-21621][Core] Reset numRecordsWritten after DiskBl...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18830
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18791: [SPARK-21571][Scheduler] Spark history server leaves inc...

2017-08-06 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/18791
  
Yea, I'm just thinking whether it is possible we can have a perfect 
approach that we can be confident to turn it on by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18576: [SPARK-21351][SQL] Update nullability based on children'...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18576
  
**[Test build #80315 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80315/testReport)**
 for PR 18576 at commit 
[`5d2fd6d`](https://github.com/apache/spark/commit/5d2fd6db8dc4130a948e5bb4d09fe0f776d16145).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18820
  
Other than few comments above, LGTM. Any other comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131559076
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1446,7 +1457,7 @@ def all_of_(xs):
 if isinstance(to_replace, (float, int, long, basestring)):
 to_replace = [to_replace]
 
-if isinstance(value, (float, int, long, basestring)):
+if isinstance(value, (float, int, long, basestring)) or value is 
None:
--- End diff --

This looks causing the warning always:

```python
>>> df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
>>> df.replace({"Alice": "Bob"}).show()
```
```
.../spark/python/pyspark/sql/dataframe.py:1466: UserWarning: to_replace is 
a dict and value is not None. value will be ignored.
  warnings.warn("to_replace is a dict and value is not None. value will be 
ignored.")
...
```

Could we put this line into 1468L (under `else`)? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131559178
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1460,7 +1471,8 @@ def all_of_(xs):
 subset = [subset]
 
 # Verify we were not passed in mixed type generics."
--- End diff --

Where we are here, let's remove this `"` at the end, which looks a typo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20783][SQL] Create CachedBatchColumnVector to abs...

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18468
  
**[Test build #80314 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80314/testReport)**
 for PR 18468 at commit 
[`a26dc15`](https://github.com/apache/spark/commit/a26dc150f6b95cc42558561cd2548de04a89f041).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18863: [SPARK-21647] [SQL] Fix SortMergeJoin when using CROSS

2017-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18863
  
**[Test build #80313 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80313/testReport)**
 for PR 18863 at commit 
[`f351fb1`](https://github.com/apache/spark/commit/f351fb1cbda8104f4f7e6ffa0be07f26b290683e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18863: [SPARK-21647] [SQL] Fix SortMergeJoin when using CROSS

2017-08-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18863
  
cc @cloud-fan @BoleynSu @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 244 matches

Mail list logo