[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
Thx for taking a deeper look @HyukjinKwon, much appreciated !
I will wait for @jiangxb1987 to also opine before committing - I want to 
make sure we are not adding incorrect behavior; given that this is a followup 
to an earlier PR (some excellent work by @szhem btw)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19501
  
**[Test build #82768 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82768/testReport)**
 for PR 19501 at commit 
[`c845627`](https://github.com/apache/spark/commit/c84562763034e3fc6a7ddba785131cb4a1c36eb4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...

2017-10-14 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/19501

[SPARK-3][SQL] ObjectHashAggregate should not introduce unnecessary 
shuffle

## What changes were proposed in this pull request?

`ObjectHashAggregateExec` should override `outputPartitioning` in order to 
avoid unnecessary shuffle.

## How was this patch tested?

Added Jenkins test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19501.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19501


commit c84562763034e3fc6a7ddba785131cb4a1c36eb4
Author: Liang-Chi Hsieh 
Date:   2017-10-15T06:02:59Z

ObjectHashAggregate should not introduce unnecessary shuffle.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
@mridulm, I just checked thought the related changes and checked the tests 
pass on branch-2.1.

Seems this PR will actually also allow the cases below:

```scala
.saveAsNewAPIHadoopFile[...]("")
.saveAsNewAPIHadoopFile[...]("::invalid:::")
```

Currently both are failed but seems this PR allows those cases:

```
Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty 
string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.(Path.java:135)
at org.apache.hadoop.fs.Path.(Path.java:89)
at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61)
...
```

```
java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
path in absolute URI: ::invalid:::
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at org.apache.hadoop.fs.Path.(Path.java:89)
at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61)
...
```

I think we should protect these cases as this.

For the cases for old one:

```scala
.saveAsHadoopFile[...]("")
.saveAsHadoopFile[...]("::invalid:::")
```

these looks failed fast (whether it was initially intended or not) and I 
guess this PR does not affect these:

```
Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty 
string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.(Path.java:135)
at 
org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)
```

```
java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
path in absolute URI: ::invalid:::
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at 
org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite to test...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19500
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite to test...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19500
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82767/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite to test...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19500
  
**[Test build #82767 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82767/testReport)**
 for PR 19500 at commit 
[`2a0a3f1`](https://github.com/apache/spark/commit/2a0a3f1b3f029c2454a471b33fed7766694fa518).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17862
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17862
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82766/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17862
  
**[Test build #82766 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82766/testReport)**
 for PR 17862 at commit 
[`0bb5afe`](https://github.com/apache/spark/commit/0bb5afe54a9a53054d2076ac28b09234a7380bbf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...

2017-10-14 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19471
  
We may need to document this change in `Migration Guide` in SQL programming 
guide.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite to test...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19500
  
**[Test build #82767 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82767/testReport)**
 for PR 19500 at commit 
[`2a0a3f1`](https://github.com/apache/spark/commit/2a0a3f1b3f029c2454a471b33fed7766694fa518).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.co...

2017-10-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19499#discussion_r144708907
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -937,26 +937,22 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   }
 
   test("test statistics of LogicalRelation converted from Hive serde 
tables") {
--- End diff --

This should be handled in a separate PR, #19500 .
After #19500, I will remove this change on test code from this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...

2017-10-14 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19500

[SPARK-22280][SQL][TEST] Improve StatisticsSuite to test `convertMetastore` 
properly

## What changes were proposed in this pull request?

This PR aims to improve **StatisticsSuite** to test `convertMetastore` 
configuration properly. Currently, some test logic in `test statistics of 
LogicalRelation converted from Hive serde tables` depends on the default 
configuration. New test case is shorter and covers both(true/false) cases 
explicitly.

## How was this patch tested?

Pass the Jenkins with the improved test case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-22280

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19500.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19500


commit 2a0a3f1b3f029c2454a471b33fed7766694fa518
Author: Dongjoon Hyun 
Date:   2017-10-15T03:38:22Z

[SPARK-22280][SQL][TEST] Improve StatisticsSuite to test `convertMetastore` 
properly




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMe...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19499
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82765/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMe...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19499
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMe...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19499
  
**[Test build #82765 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82765/testReport)**
 for PR 19499 at commit 
[`83cde8b`](https://github.com/apache/spark/commit/83cde8b2fcf1fb12567cd0bf7eef702186234a23).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17862
  
**[Test build #82766 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82766/testReport)**
 for PR 17862 at commit 
[`0bb5afe`](https://github.com/apache/spark/commit/0bb5afe54a9a53054d2076ac28b09234a7380bbf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19452: [SPARK-22136][SS] Evaluate one-sided conditions e...

2017-10-14 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/19452#discussion_r144708446
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala
 ---
@@ -349,14 +350,35 @@ case class StreamingSymmetricHashJoinExec(
   /**
* Internal helper class to consume input rows, generate join output 
rows using other sides
* buffered state rows, and finally clean up this sides buffered state 
rows
+   *
+   * @param joinSide The JoinSide - either left or right.
+   * @param inputAttributes The input attributes for this side of the join.
+   * @param joinKeys The join keys.
+   * @param inputIter The iterator of input rows on this side to be joined.
+   * @param preJoinFilterExpr A filter over rows on this side. This filter 
rejects rows that could
+   *  never pass the overall join condition no 
matter what other side row
+   *  they're joined with.
+   * @param postJoinFilterExpr A filter over joined rows. This filter 
completes the application of
+   *   the overall join condition, assuming that 
preJoinFilter on both sides
+   *   of the join has already been passed.
+   * @param stateWatermarkPredicate The state watermark predicate. See
+   *[[StreamingSymmetricHashJoinExec]] for 
further description of
+   *state watermarks.
*/
   private class OneSideHashJoiner(
   joinSide: JoinSide,
   inputAttributes: Seq[Attribute],
   joinKeys: Seq[Expression],
   inputIter: Iterator[InternalRow],
+  preJoinFilterExpr: Option[Expression],
+  postJoinFilterExpr: Option[Expression],
   stateWatermarkPredicate: Option[JoinStateWatermarkPredicate]) {
 
+// Filter the joined rows based on the given condition.
+val preJoinFilter =
+  newPredicate(preJoinFilterExpr.getOrElse(Literal(true)), 
inputAttributes).eval _
+val postJoinFilter = 
newPredicate(postJoinFilterExpr.getOrElse(Literal(true)), output).eval _
--- End diff --

this is incorrect. the schema os the rows on which this filter will be 
applied is `left.output ++ right.output`. You need to apply another projection 
to put the JoinedRow in an UnsafeRow of the schema `output`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMe...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19499
  
**[Test build #82765 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82765/testReport)**
 for PR 19499 at commit 
[`83cde8b`](https://github.com/apache/spark/commit/83cde8b2fcf1fb12567cd0bf7eef702186234a23).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19467: [SPARK-22238] Fix plan resolution bug caused by E...

2017-10-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19467


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-14 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19459
  
LGTM with few minor comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82764/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-14 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r144706910
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,43 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, df, schema):
--- End diff --

nit: df -> pdf.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19459
  
**[Test build #82764 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82764/testReport)**
 for PR 19459 at commit 
[`f42e351`](https://github.com/apache/spark/commit/f42e35175969d8d7363e008a586a6f6982290447).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-14 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r144706853
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,43 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, df, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing the 
into partitions, converting
+to Arrow data, then reading into the JVM to parallelsize. If a 
schema is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+import os
+from tempfile import NamedTemporaryFile
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_schema
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(df) // self.sparkContext.defaultParallelism)  # 
round int up
+df_slices = (df[start:start + step] for start in xrange(0, 
len(df), step))
+arrow_schema = to_arrow_schema(schema) if schema is not None else 
None
+batches = [pa.RecordBatch.from_pandas(df_slice, 
schema=arrow_schema, preserve_index=False)
+   for df_slice in df_slices]
+
+# write batches to temp file, read by JVM (borrowed from 
context.parallelize)
+tempFile = NamedTemporaryFile(delete=False, dir=self._sc._temp_dir)
--- End diff --

This looks kind of duplicate with the main logic of `context.parallelize`. 
Maybe we can extract a common function from it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19467: [SPARK-22238] Fix plan resolution bug caused by EnsureSt...

2017-10-14 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/19467
  
Merging to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-14 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r144706672
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,43 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, df, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing the 
into partitions, converting
--- End diff --

typo: slicing the.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-14 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/19459
  
Thanks for the reviews @ueshin and @HyukjinKwon!  I added `to_arrow_schema` 
conversion for when a schema is passed into `createDataFrame` and added some 
new tests to verify it. Please take another look when you can, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19459
  
**[Test build #82764 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82764/testReport)**
 for PR 19459 at commit 
[`f42e351`](https://github.com/apache/spark/commit/f42e35175969d8d7363e008a586a6f6982290447).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMe...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19499
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82763/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMe...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19499
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMe...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19499
  
**[Test build #82763 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82763/testReport)**
 for PR 19499 at commit 
[`b9c4954`](https://github.com/apache/spark/commit/b9c495490ca5b3ce07b413f9c4cc7b2f2e1d713b).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19496
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19496
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82762/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19496
  
**[Test build #82762 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82762/testReport)**
 for PR 19496 at commit 
[`de2aa69`](https://github.com/apache/spark/commit/de2aa6975c31f4c095e07a34b66b24ee39f83b01).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMe...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19499
  
**[Test build #82763 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82763/testReport)**
 for PR 19499 at commit 
[`b9c4954`](https://github.com/apache/spark/commit/b9c495490ca5b3ce07b413f9c4cc7b2f2e1d713b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19499: [SPARK-22279][SQL][WIP] Turn on spark.sql.hive.co...

2017-10-14 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19499

[SPARK-22279][SQL][WIP] Turn on spark.sql.hive.convertMetastoreOrc by 
default

## What changes were proposed in this pull request?

Like Parquet, this PR aims to turn on `spark.sql.hive.convertMetastoreOrc` 
by default.

## How was this patch tested?

Pass all the existing test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-22279

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19499.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19499


commit b9c495490ca5b3ce07b413f9c4cc7b2f2e1d713b
Author: Dongjoon Hyun 
Date:   2017-10-14T18:49:27Z

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19496
  
**[Test build #82762 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82762/testReport)**
 for PR 19496 at commit 
[`de2aa69`](https://github.com/apache/spark/commit/de2aa6975c31f4c095e07a34b66b24ee39f83b01).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-14 Thread jomach
Github user jomach commented on the issue:

https://github.com/apache/spark/pull/19485
  
Ok so I will do: 
  - Create a new Section for csv-datasets
  - add more  example options on the code fromJavaSQLDataSourceExample.java 
(.scala .py and .r)
  - Make reference to the links from the api. 

This will have the effect that we will not see all the options on .md page 
and people will need to jump in to the api. Do you agree with this ? 

Cool would be if from jekyllrb we could create something like a iframe and 
get the options from the scala api... Any ideias ? 

Please net me know if it is ok to proceed this way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19485
  
Thanks for taking a look for this one. Actually, I thought we should add a 
chapter like 
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

And, add a link to, for example, 
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
 for Python, 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame
 for Scala and 
http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv-scala.collection.Seq-
 for Java to refer the options, rather than duplicating the option list (which 
we should duplicately update when we fix or add options).

Probably, we should add some links to JSON ones too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19498
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82761/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19498
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19498
  
**[Test build #82761 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82761/testReport)**
 for PR 19498 at commit 
[`f5a2a88`](https://github.com/apache/spark/commit/f5a2a884d860e9c8b3f98fc4ae5f10eaf3c1a0a4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19498
  
**[Test build #82761 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82761/testReport)**
 for PR 19498 at commit 
[`f5a2a88`](https://github.com/apache/spark/commit/f5a2a884d860e9c8b3f98fc4ae5f10eaf3c1a0a4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19498
  
Hi @zsxwing, I happened to look into this one. Could you take a look and 
see if it makes sense please?

cc @zero323 (reporter) too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to av...

2017-10-14 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/19498

[SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mismatch 
in PythonTransformFunction

## What changes were proposed in this pull request?

This PR proposes to wrap the transformed rdd within `TransformFunction`. 
`PythonTransformFunction` looks requiring to return `JavaRDD` in `_jrdd`.


https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/streaming/util.py#L67


https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala#L43

However, this could be `JavaPairRDD` by some APIs, for example, `zip` in 
PySpark's RDD API.
`_jrdd` could be checked as below:

```python
>>> rdd.zip(rdd)._jrdd.getClass().toString()
u'class org.apache.spark.api.java.JavaPairRDD'
```

So, here, I wrapped it with `map` so that it ensures returning `JavaRDD`.

```python
>>> rdd.zip(rdd).map(lambda x: x)._jrdd.getClass().toString()
u'class org.apache.spark.api.java.JavaRDD'
```

I tried to elaborate some failure cases as below:

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]) \
.transform(lambda rdd: rdd.cartesian(rdd)) \
.pprint()
ssc.start()
```

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.cartesian(rdd))
ssc.start()
```

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd))
ssc.start()
```

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: 
rdd.zip(rdd).union(rdd.zip(rdd)))
ssc.start()
```

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: 
rdd.zip(rdd).coalesce(1))
ssc.start()
```

## How was this patch tested?

Unit tests were added in `python/pyspark/streaming/tests.py` and manually 
tested.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-17756

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19498.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19498


commit f5a2a884d860e9c8b3f98fc4ae5f10eaf3c1a0a4
Author: hyukjinkwon 
Date:   2017-10-14T13:50:49Z

Workaround to avoid return type mispatch in PythonTransformFunction




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19442
  
@MLnick Can you take a look and give me some suggestion? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19442
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82760/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19442
  
**[Test build #82760 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82760/testReport)**
 for PR 19442 at commit 
[`51440b4`](https://github.com/apache/spark/commit/51440b4eeefece4f899a21ef1a0a63399a9f95ac).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19442
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19442
  
**[Test build #82760 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82760/testReport)**
 for PR 19442 at commit 
[`51440b4`](https://github.com/apache/spark/commit/51440b4eeefece4f899a21ef1a0a63399a9f95ac).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19480
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82758/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19480
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19480
  
**[Test build #82758 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82758/testReport)**
 for PR 19480 at commit 
[`37506dc`](https://github.com/apache/spark/commit/37506dcc380cf5c14ea929b33f9e8e26efdbcb8d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19419
  
**[Test build #3947 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3947/testReport)**
 for PR 19419 at commit 
[`5c76b91`](https://github.com/apache/spark/commit/5c76b914ecbd7fd82276496151f7ed89fe519025).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19442
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19442
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82759/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19442
  
**[Test build #82759 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82759/testReport)**
 for PR 19442 at commit 
[`2b94dd5`](https://github.com/apache/spark/commit/2b94dd5c192b1d9302e24c0392fc9a5aaaedb596).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82757/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19222
  
**[Test build #82757 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82757/testReport)**
 for PR 19222 at commit 
[`6e8d5b8`](https://github.com/apache/spark/commit/6e8d5b820c83517d0340d748959b855229e664a7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19442
  
**[Test build #82759 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82759/testReport)**
 for PR 19442 at commit 
[`2b94dd5`](https://github.com/apache/spark/commit/2b94dd5c192b1d9302e24c0392fc9a5aaaedb596).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-14 Thread viirya
GitHub user viirya reopened a pull request:

https://github.com/apache/spark/pull/19442

[SPARK-8515][ML][WIP] Improve ML Attribute API

## What changes were proposed in this pull request?

The current ML attribute API has issues like inefficiency and not easy to 
use. This work tries to improve this API with main changes:

* Support spark vector-typed attributes.
* Simplify vector-typed attribute serialization. 
* Keep minimum APIs to support ML attributes.

** THIS WORK is not ready and is working in progress.

## How was this patch tested?

Added tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-8515

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19442.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19442


commit 77d657d8bc8102081e4b0d7b5d42a256e64514d4
Author: Liang-Chi Hsieh 
Date:   2017-10-02T15:03:54Z

Init design of ml attribute.

commit 7837778e7cbbf83851b1a2b5047f4e6a8039f809
Author: Liang-Chi Hsieh 
Date:   2017-10-03T15:03:31Z

revise.

commit 97f6848f0cbb1a76b4434930ce8938da50eaafbe
Author: Liang-Chi Hsieh 
Date:   2017-10-03T15:14:02Z

revise.

commit 2e3a3541fc7a59ac63b2118228de8015c238de40
Author: Liang-Chi Hsieh 
Date:   2017-10-04T05:15:58Z

revise.

commit 0d76eac84f5837aefebc763687fa9c5c7e1aeb4d
Author: Liang-Chi Hsieh 
Date:   2017-10-04T15:07:57Z

revise.

commit 81cca5cccfa2556ff0bba5a73764d3f503040b13
Author: Liang-Chi Hsieh 
Date:   2017-10-05T04:30:48Z

revise.

commit 4813fe8a4bd19a02b7b6bff138f04e7e50f7cdd7
Author: Liang-Chi Hsieh 
Date:   2017-10-05T06:15:53Z

revise.

commit 7951f59027418962ad95465e439bff41876ecfa8
Author: Liang-Chi Hsieh 
Date:   2017-10-05T07:51:50Z

revise.

commit a381af3edf52132086af64360789cb3a7d20d61e
Author: Liang-Chi Hsieh 
Date:   2017-10-05T09:00:02Z

Add builder and test.

commit f25c89dbded0eb9dce25d8da63a1a1aa49ad459f
Author: Liang-Chi Hsieh 
Date:   2017-10-05T15:10:11Z

revise test.

commit 7e237f38088f2375f40f9a4c97aee2e6acd54328
Author: Liang-Chi Hsieh 
Date:   2017-10-06T02:46:07Z

Add new test.

commit 77ced957e7be2169ac0c59c76f60ab9d4fcac3ef
Author: Liang-Chi Hsieh 
Date:   2017-10-06T03:57:12Z

Add more tests.

commit de0aa76199141255258d9d5b12a0d31b1758c6f1
Author: Liang-Chi Hsieh 
Date:   2017-10-06T06:17:29Z

revise.

commit d828cf3d3b13a2b2b1990bdff9593b49e53f6cf9
Author: Liang-Chi Hsieh 
Date:   2017-10-06T13:55:41Z

Add java-friendly APIs for attribute types.

commit 5844fbaef5d5825eafadb7c53196fb2132937e4e
Author: Liang-Chi Hsieh 
Date:   2017-10-09T03:24:26Z

Revise APIs.

commit da0fcef7d3370ebca97d200f01e9f2814a9ed755
Author: Liang-Chi Hsieh 
Date:   2017-10-09T03:26:15Z

revise.

commit 66be26cd7f25614137cfb9722f859f36d9f80c0c
Author: Liang-Chi Hsieh 
Date:   2017-10-09T03:47:43Z

Add default constructors to attribute types.

commit ce80ed5b693745fa4a650e508c6cd9e24350c52e
Author: Liang-Chi Hsieh 
Date:   2017-10-10T12:52:22Z

Use Array instead of Seq in APIs.

commit 2b94dd5c192b1d9302e24c0392fc9a5aaaedb596
Author: Liang-Chi Hsieh 
Date:   2017-10-14T00:21:04Z

Add more compatibility tests.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Improve performance of UnsafeRow...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19472
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82756/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Improve performance of UnsafeRow...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19472
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Improve performance of UnsafeRow...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19472
  
**[Test build #82756 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82756/testReport)**
 for PR 19472 at commit 
[`150e0a3`](https://github.com/apache/spark/commit/150e0a30ac4ed11c783d62d47c4404c854b03dd9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19497
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19497
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82754/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19497
  
**[Test build #82754 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82754/testReport)**
 for PR 19497 at commit 
[`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19496
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19496
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82755/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19496
  
**[Test build #82755 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82755/testReport)**
 for PR 19496 at commit 
[`a3437ee`](https://github.com/apache/spark/commit/a3437ee4a87d1f51b362adeb20d4fcc264085ba7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19494: [SPARK-22249][SQL] isin with empty list throws ex...

2017-10-14 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19494#discussion_r144690815
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -104,7 +104,8 @@ case class InMemoryTableScanExec(
 
 case In(a: AttributeReference, list: Seq[Expression]) if 
list.forall(_.isInstanceOf[Literal]) =>
   list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
-l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+l.asInstanceOf[Literal] <= statsFor(a).upperBound)
--- End diff --

It was a mistake, sorry. It returned always `false`.
I see what you mean, but in this piece of code we are only building the 
`Expression` and we are not evaluating it. Thus it is not possible to 
short-circuit, because the `Expression` must be built entirely.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19480
  
**[Test build #82758 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82758/testReport)**
 for PR 19480 at commit 
[`37506dc`](https://github.com/apache/spark/commit/37506dcc380cf5c14ea929b33f9e8e26efdbcb8d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19419
  
**[Test build #3947 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3947/testReport)**
 for PR 19419 at commit 
[`5c76b91`](https://github.com/apache/spark/commit/5c76b914ecbd7fd82276496151f7ed89fe519025).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...

2017-10-14 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19419#discussion_r144689872
  
--- Diff: docs/configuration.md ---
@@ -2013,7 +2013,6 @@ Apart from these, the following properties are also 
available, and may be useful
 
 
 
-
--- End diff --

If you have to change the pull request again, I'd revert this, but no need 
to change it only for this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...

2017-10-14 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19419#discussion_r144689866
  
--- Diff: docs/security.md ---
@@ -186,7 +186,52 @@ configure those ports.
   
 
 
+### HTTP Security Headers
+
+Apache Spark can be configured to include HTTP Headers which aids in 
preventing Cross 
+Site Scripting (XSS), Cross-Frame Scripting (XFS), MIME-Sniffing and also 
enforces HTTP 
+Strict Transport Security.
+
+
+Property NameDefaultMeaning
+
+spark.ui.xXssProtection
+None
+
+Value for HTTP X-XSS-Protection response header. You can 
choose appropriate value 
+from below:
+
--- End diff --

Why not just leave this as a bulleted list? Not a big deal I guess just 
less conventional for HTML


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19464


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19464
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...

2017-10-14 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19496#discussion_r144689475
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
 ---
@@ -80,7 +80,8 @@ case class Average(child: Expression) extends 
DeclarativeAggregate with Implicit
 case DecimalType.Fixed(p, s) =>
   // increase the precision and scale to prevent precision loss
   val dt = DecimalType.bounded(p + 14, s + 4)
-  Cast(Cast(sum, dt) / Cast(count, dt), resultType)
+  Cast(Cast(sum, dt) / Cast(count, DecimalType.bounded 
(DecimalType.MAX_PRECISION, 0)),
--- End diff --

No need to add space after `bounded`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144687101
  
--- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
@@ -510,4 +510,83 @@ class FileSuite extends SparkFunSuite with 
LocalSparkContext {
 }
   }
 
+  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
+val conf = new SparkConf()
+conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, 
true)
+sc = new SparkContext(conf)
+
+def testIgnoreEmptySplits(
+data: Array[Tuple2[String, String]],
+actualPartitionNum: Int,
+expectedPartitionNum: Int): Unit = {
+  val output = new File(tempDir, "output")
+  sc.parallelize(data, actualPartitionNum)
+.saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
+  for (i <- 0 until actualPartitionNum) {
+assert(new File(output, s"part-$i").exists() === true)
+  }
+  val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
+  assert(hadoopRDD.partitions.length === expectedPartitionNum)
+  Utils.deleteRecursively(output)
+}
+
+// Ensure that if all of the splits are empty, we remove the splits 
correctly
+testIgnoreEmptySplits(
+  data = Array.empty[Tuple2[String, String]],
+  actualPartitionNum = 1,
+  expectedPartitionNum = 0)
+
+// Ensure that if no split is empty, we don't lose any splits
+testIgnoreEmptySplits(
+  data = Array(("key1", "a"), ("key2", "a"), ("key3", "b")),
+  actualPartitionNum = 2,
+  expectedPartitionNum = 2)
+
+// Ensure that if part of the splits are empty, we remove the splits 
correctly
+testIgnoreEmptySplits(
+  data = Array(("key1", "a"), ("key2", "a")),
+  actualPartitionNum = 5,
+  expectedPartitionNum = 2)
+  }
+
+  test("spark.files.ignoreEmptySplits work correctly (new Hadoop API)") {
+val conf = new SparkConf()
+conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, 
true)
+sc = new SparkContext(conf)
+
+def testIgnoreEmptySplits(
+data: Array[Tuple2[String, String]],
+actualPartitionNum: Int,
+expectedPartitionNum: Int): Unit = {
+  val output = new File(tempDir, "output")
+  sc.parallelize(data, actualPartitionNum)
+.saveAsNewAPIHadoopFile[NewTextOutputFormat[String, 
String]](output.getPath)
+  for (i <- 0 until actualPartitionNum) {
+assert(new File(output, s"part-r-$i").exists() === true)
+  }
+  val hadoopRDD = sc.newAPIHadoopFile(new File(output, 
"part-r-*").getPath,
+classOf[NewTextInputFormat], classOf[LongWritable], classOf[Text])
+.asInstanceOf[NewHadoopRDD[_, _]]
--- End diff --

nit:

```scala
val hadoopRDD = sc.newAPIHadoopFile(
  new File(output, "part-r-*").getPath,
  classOf[NewTextInputFormat],
  classOf[LongWritable],
  classOf[Text]).asInstanceOf[NewHadoopRDD[_, _]]
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19494: [SPARK-22249][SQL] isin with empty list throws ex...

2017-10-14 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19494#discussion_r144689325
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -104,7 +104,8 @@ case class InMemoryTableScanExec(
 
 case In(a: AttributeReference, list: Seq[Expression]) if 
list.forall(_.isInstanceOf[Literal]) =>
   list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
-l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+l.asInstanceOf[Literal] <= statsFor(a).upperBound)
--- End diff --

I see. How does `.contains(true)` work then? or did that not work?
I suppose all I mean is that we should write something that works on an 
empty list (returns false?) and also short-circuits (stops when anything is 
true). Is that possible?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19222
  
**[Test build #82757 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82757/testReport)**
 for PR 19222 at commit 
[`6e8d5b8`](https://github.com/apache/spark/commit/6e8d5b820c83517d0340d748959b855229e664a7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

2017-10-14 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19480#discussion_r144688592
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
   Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 
3, 1), Row(7, 3, 2)))
   }
+
+  test("SPARK-6: splitExpressions should not generate codes beyond 
64KB") {
+val colNumber = 1
+val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
+val df = sqlContext.createDataFrame(input, StructType(
+  (1 to colNumber).map(colIndex => StructField(s"_$colIndex", 
IntegerType, false
+val newCols = (1 to colNumber).flatMap { colIndex =>
+  Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
+expr(s"sqrt(_$colIndex)"))
+}
+df.select(newCols: _*).collect()
+  }
+
+  test("SPARK-6: too many splitted expressions should not exceed 
constant pool limit") {
--- End diff --

Btw, since this test didn't test what we want to test. We should remove it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

2017-10-14 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19480#discussion_r144688579
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
   Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 
3, 1), Row(7, 3, 2)))
   }
+
+  test("SPARK-6: splitExpressions should not generate codes beyond 
64KB") {
+val colNumber = 1
+val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
+val df = sqlContext.createDataFrame(input, StructType(
+  (1 to colNumber).map(colIndex => StructField(s"_$colIndex", 
IntegerType, false
+val newCols = (1 to colNumber).flatMap { colIndex =>
+  Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
+expr(s"sqrt(_$colIndex)"))
+}
+df.select(newCols: _*).collect()
+  }
+
+  test("SPARK-6: too many splitted expressions should not exceed 
constant pool limit") {
--- End diff --

The unit test added into `CodeGenerationSuite` looks sufficient for 
identifying this particular issue regarding constant pool limit in outer class 
due to too many method calls.

It is hard to contrive an end-to-end test so far purely for reproducing 
this particular issue. At least I failed to contrive one after several tries.

So let wait if anyone has the chance or insights to create one.

If no, I think the unit case in `CodeGenerationSuite` should be good enough.
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-14 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
@tejasapatil I updated performance results for operations that more used.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Improve performance of UnsafeRow...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19472
  
**[Test build #82756 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82756/testReport)**
 for PR 19472 at commit 
[`150e0a3`](https://github.com/apache/spark/commit/150e0a30ac4ed11c783d62d47c4404c854b03dd9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19496
  
**[Test build #82755 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82755/testReport)**
 for PR 19496 at commit 
[`a3437ee`](https://github.com/apache/spark/commit/a3437ee4a87d1f51b362adeb20d4fcc264085ba7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19496
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
Let me take a look with few tests and be back. Also I think I should cc 
@jiangxb1987 too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19497
  
**[Test build #82754 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82754/testReport)**
 for PR 19497 at commit 
[`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

2017-10-14 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19480#discussion_r144688322
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
   Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 
3, 1), Row(7, 3, 2)))
   }
+
+  test("SPARK-6: splitExpressions should not generate codes beyond 
64KB") {
+val colNumber = 1
+val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
+val df = sqlContext.createDataFrame(input, StructType(
+  (1 to colNumber).map(colIndex => StructField(s"_$colIndex", 
IntegerType, false
+val newCols = (1 to colNumber).flatMap { colIndex =>
+  Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
+expr(s"sqrt(_$colIndex)"))
+}
+df.select(newCols: _*).collect()
+  }
+
+  test("SPARK-6: too many splitted expressions should not exceed 
constant pool limit") {
--- End diff --

You are right @viirya. Sorry, I didn't notice. Yes the problem is that most 
of the times we have both these issues at the moment, thus solving one is not 
enough. It turns out that there are some corner cases in which this fix is 
enough, like the real case I am working on. But it is not easy to reproduce 
them in a simple way. In this use case there are a lot of complex projections a 
`dropDuplicate` and some joins after that. But there are query made of 
thousands of lines of SQL code.
The only way I have been able to reproduce it easily is in this test case: 
https://github.com/apache/spark/pull/19480/files#r144302922.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19497
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19497
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82753/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19497
  
**[Test build #82753 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82753/testReport)**
 for PR 19497 at commit 
[`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org