[GitHub] spark pull request #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in...

2018-08-15 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/22104#discussion_r210391237
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3367,6 +3367,35 @@ def test_ignore_column_of_all_nulls(self):
 finally:
 shutil.rmtree(path)
 
+# SPARK-24721
+def test_datasource_with_udf_filter_lit_input(self):
+import pandas as pd
+import numpy as np
+from pyspark.sql.functions import udf, pandas_udf, lit, col
+
+path = tempfile.mkdtemp()
+shutil.rmtree(path)
+try:
+
self.spark.range(1).write.mode("overwrite").format('csv').save(path)
+filesource_df = self.spark.read.csv(path)
--- End diff --

@gatorsmile Added tests for file source, data source and data source v2. I 
might need to move the pandas_udf tests into another tests because of 
arrow_requirement :(


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-15 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21698
  
I tried a prototype to fix the handling of fetch failure, seems not that 
hard: https://github.com/apache/spark/pull/22112


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in...

2018-08-15 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/22104#discussion_r210390770
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala
 ---
@@ -133,6 +134,9 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with 
PredicateHelper {
   }
 
   def apply(plan: SparkPlan): SparkPlan = plan transformUp {
+// SPARK-24721: Ignore Python UDFs in DataSourceScan and 
DataSourceV2Scan
+case plan: DataSourceScanExec => plan
--- End diff --

I get rid of the logic previously in `FileSourceStrategy` to exclude 
PythonUDF in the filter in favor of this fix - I think this fix is cleaner. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #94821 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94821/testReport)**
 for PR 22112 at commit 
[`1f9f6e5`](https://github.com/apache/spark/commit/1f9f6e5b020038be1e7c11b9923010465da385aa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22104
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() da...

2018-08-15 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/22112

[WIP][SPARK-23243][Core] Fix RDD.repartition() data correctness issue

## What changes were proposed in this pull request?

An alternative fix for https://github.com/apache/spark/pull/21698

RDD can take arbitrary user function, but we have an assumption: the 
function should produce same data set for same input, but the order can change.

Spark scheduler must take care of this assumption when fetch failure 
happens, otherwise we may hit correctness issue as the JIRA ticket described.

Generall speaking, when a map stage gets retried because of fetch failure, 
and this map stage is not idempotent(produce same data set but different order 
each time), and the shuffle partitioner is sensitive to the input data 
order(like round robin partitioner), we should retry all the reduce tasks.

TODO: document and test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark repartition

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22112


commit 1f9f6e5b020038be1e7c11b9923010465da385aa
Author: Wenchen Fan 
Date:   2018-08-15T18:38:24Z

fix repartition+shuffle bug




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22104
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2223/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in...

2018-08-15 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/22104#discussion_r210390399
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala
 ---
@@ -117,15 +117,16 @@ abstract class EvalPythonExec(udfs: Seq[PythonUDF], 
output: Seq[Attribute], chil
   }
 }.toArray
   }.toArray
-  val projection = newMutableProjection(allInputs, child.output)
+  val projection = UnsafeProjection.create(allInputs, child.output)
   val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) =>
 StructField(s"_$i", dt)
   })
 
   // Add rows to queue to join later with the result.
   val projectedRowIter = iter.map { inputRow =>
-queue.add(inputRow.asInstanceOf[UnsafeRow])
-projection(inputRow)
+val unsafeRow = projection(inputRow)
+queue.add(unsafeRow.asInstanceOf[UnsafeRow])
--- End diff --

This is probably another bug I found in testing this - If the input node to 
EvalPythonExec doesn't produce UnsafeRow, and cast here will fail. 

I found this in testing when I pass in an test data source scan node, which 
produces GeneralInternalRow, will throw exception here.

I am happy to submit this as a separate patch if people think it's 
necessary 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21977
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21889
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21977
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94804/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22104
  
**[Test build #94820 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94820/testReport)**
 for PR 22104 at commit 
[`38f3dbb`](https://github.com/apache/spark/commit/38f3dbbbd7d77b59b8441daf14f3a94ead1401b9).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21889
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94805/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22104
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94820/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22104
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21977
  
**[Test build #94804 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94804/testReport)**
 for PR 21977 at commit 
[`6810fc7`](https://github.com/apache/spark/commit/6810fc7f30131750a1a472467b7984a2c9150a9b).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21889
  
**[Test build #94805 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94805/testReport)**
 for PR 21889 at commit 
[`8d822ee`](https://github.com/apache/spark/commit/8d822eea805e1b2dc40b866ca8ac4893e53ad51b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22104
  
**[Test build #94820 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94820/testReport)**
 for PR 22104 at commit 
[`38f3dbb`](https://github.com/apache/spark/commit/38f3dbbbd7d77b59b8441daf14f3a94ead1401b9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22013: [SPARK-23939][SQL] Add transform_keys function

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22013
  
**[Test build #94819 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94819/testReport)**
 for PR 22013 at commit 
[`2f4943f`](https://github.com/apache/spark/commit/2f4943f3cec0705c296b2988c415ac3372b7ea86).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak an...

2018-08-15 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22106#discussion_r210387003
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala
 ---
@@ -40,12 +40,7 @@ class KafkaContinuousSinkSuite extends 
KafkaContinuousTest {
 
   override val streamingTimeout = 30.seconds
 
-  override def beforeAll(): Unit = {
-super.beforeAll()
-testUtils = new KafkaTestUtils(
-  withBrokerProps = Map("auto.create.topics.enable" -> "false"))
-testUtils.setup()
-  }
+  override val brokerProps = Map("auto.create.topics.enable" -> "false")
--- End diff --

This is the fix for Kafka cluster leak


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22106
  
**[Test build #4262 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4262/testReport)**
 for PR 22106 at commit 
[`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22106
  
**[Test build #4261 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4261/testReport)**
 for PR 22106 at commit 
[`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22106
  
**[Test build #4260 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4260/testReport)**
 for PR 22106 at commit 
[`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22106
  
**[Test build #4259 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4259/testReport)**
 for PR 22106 at commit 
[`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21308
  
**[Test build #94818 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94818/testReport)**
 for PR 21308 at commit 
[`e32e6c4`](https://github.com/apache/spark/commit/e32e6c4a4c3df527a5fddb8b694b0ed303e16fc4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22013: [SPARK-23939][SQL] Add transform_keys function

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22013
  
**[Test build #94817 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94817/testReport)**
 for PR 22013 at commit 
[`58b60b2`](https://github.com/apache/spark/commit/58b60b2f851fb1464743257fe1cca075a1e77ba9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21308
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21308
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified//
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22106
  
**[Test build #94816 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94816/testReport)**
 for PR 22106 at commit 
[`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22106
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22106
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2221/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22106
  
**[Test build #94815 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94815/testReport)**
 for PR 22106 at commit 
[`621099e`](https://github.com/apache/spark/commit/621099e39bd1faa8235f2a2a6e7db68939b68f81).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak an...

2018-08-15 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22106#discussion_r210383608
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
 ---
@@ -216,7 +216,7 @@ class KafkaContinuousInputPartitionReader(
   } catch {
 // We didn't read within the timeout. We're supposed to block 
indefinitely for new data, so
 // swallow and ignore this.
-case _: TimeoutException =>
+case _: TimeoutException | _: 
org.apache.kafka.common.errors.TimeoutException =>
--- End diff --

This is to fix 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4254/

`org.apache.kafka.common.errors.TimeoutException: Timeout of 3000ms expired 
before the position for partition failOnDataLoss-2-0 could be determined` 
triggered a task retry but as continuous processing doesn't support task 
retries, it failed with 
`org.apache.spark.sql.execution.streaming.continuous.ContinuousTaskRetryException:
 Continuous execution does not support task retry`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22106
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2219/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21308
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2220/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21308
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22106
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.

2018-08-15 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21308
  
@rxin, I've updated this API to use `Filter` instead of `Expression`. I'd 
ideally like to get it in soon if you guys have a chance to review it. It's 
pretty small.

cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21308: SPARK-24253: Add DeleteSupport mix-in for DataSou...

2018-08-15 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21308#discussion_r210382412
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2;
+
+import org.apache.spark.sql.catalyst.expressions.Expression;
+
+/**
+ * A mix-in interface for {@link DataSourceV2} delete support. Data 
sources can implement this
+ * interface to provide the ability to delete data from tables that 
matches filter expressions.
+ * 
+ * Data sources must implement this interface to support logical 
operations that combine writing
+ * data with deleting data, like overwriting partitions.
+ */
+public interface DeleteSupport extends DataSourceV2 {
+  /**
+   * Delete data from a data source table that matches filter expressions.
+   * 
+   * Rows are deleted from the data source iff all of the filter 
expressions match. That is, the
+   * expressions must be interpreted as a set of filters that are ANDed 
together.
+   * 
+   * Implementations may reject a delete operation if the delete isn't 
possible without significant
+   * effort. For example, partitioned data sources may reject deletes that 
do not filter by
+   * partition columns because the filter may require rewriting files 
without deleted records.
+   * To reject a delete implementations should throw {@link 
IllegalArgumentException} with a clear
+   * error message that identifies which expression was rejected.
+   * 
+   * Implementations may throw {@link UnsupportedOperationException} if 
the delete operation is not
--- End diff --

After updating this to use Filter, the UnsupportedOperationException is no 
longer needed, so I removed it. That should also cut down on the confusion here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21308
  
**[Test build #94814 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94814/testReport)**
 for PR 21308 at commit 
[`db77b9a`](https://github.com/apache/spark/commit/db77b9a3f6e2364c2107c5dfd60ff57b1d7caa0f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21912
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21912
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2218/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21909
  
**[Test build #94813 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94813/testReport)**
 for PR 21909 at commit 
[`6b98f3e`](https://github.com/apache/spark/commit/6b98f3edf19b6ca0887224c598d6f3fa88a762d1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21912
  
**[Test build #94812 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94812/testReport)**
 for PR 21912 at commit 
[`240936e`](https://github.com/apache/spark/commit/240936e98c6b32855236c7c635bacc0fb04acb76).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/21561#discussion_r210373934
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -246,6 +245,16 @@ class BisectingKMeans private (
 new BisectingKMeansModel(root, this.distanceMeasure)
   }
 
+  /**
+   * Runs the bisecting k-means algorithm.
+   * @param input RDD of vectors
+   * @return model for the bisecting kmeans
+   */
+  @Since("1.6.0")
--- End diff --

Nit: this should be since 2.4?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22013: [SPARK-23939][SQL] Add transform_keys function

2018-08-15 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/22013#discussion_r210373675
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -2302,6 +2302,97 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
 assert(ex5.getMessage.contains("function map_zip_with does not support 
ordering on type map"))
   }
 
+  test("transform keys function - test various primitive data types 
combinations") {
+val dfExample1 = Seq(
+  Map[Int, Int](1 -> 1, 9 -> 9, 8 -> 8, 7 -> 7)
+).toDF("i")
+
+val dfExample2 = Seq(
+  Map[Int, Double](1 -> 1.0E0, 2 -> 1.4E0, 3 -> 1.7E0)
+).toDF("j")
+
+val dfExample3 = Seq(
+  Map[Int, Boolean](25 -> true, 26 -> false)
+).toDF("x")
+
+val dfExample4 = Seq(
+  Map[Array[Int], Boolean](Array(1, 2) -> false)
+).toDF("y")
+
+
+def testMapOfPrimitiveTypesCombination(): Unit = {
+  checkAnswer(dfExample1.selectExpr("transform_keys(i, (k, v) -> k + 
v)"),
+Seq(Row(Map(2 -> 1, 18 -> 9, 16 -> 8, 14 -> 7
+
+  checkAnswer(dfExample2.selectExpr("transform_keys(j, " +
+"(k, v) -> map_from_arrays(ARRAY(1, 2, 3), ARRAY('one', 'two', 
'three'))[k])"),
+Seq(Row(Map("one" -> 1.0, "two" -> 1.4, "three" -> 1.7
+
+  checkAnswer(dfExample2.selectExpr("transform_keys(j, (k, v) -> 
CAST(v * 2 AS BIGINT) + k)"),
+Seq(Row(Map(3 -> 1.0, 4 -> 1.4, 6 -> 1.7
+
+  checkAnswer(dfExample2.selectExpr("transform_keys(j, (k, v) -> k + 
v)"),
+Seq(Row(Map(2.0 -> 1.0, 3.4 -> 1.4, 4.7 -> 1.7
+
+  checkAnswer(dfExample3.selectExpr("transform_keys(x, (k, v) ->  k % 
2 = 0 OR v)"),
+Seq(Row(Map(true -> true, true -> false
+
+  checkAnswer(dfExample3.selectExpr("transform_keys(x, (k, v) -> if(v, 
2 * k, 3 * k))"),
+Seq(Row(Map(50 -> true, 78 -> false
+
+  checkAnswer(dfExample3.selectExpr("transform_keys(x, (k, v) -> if(v, 
2 * k, 3 * k))"),
+Seq(Row(Map(50 -> true, 78 -> false
+
+  checkAnswer(dfExample4.selectExpr("transform_keys(y, (k, v) -> 
array_contains(k, 3) AND v)"),
+Seq(Row(Map(false -> false
+}
+// Test with local relation, the Project will be evaluated without 
codegen
+testMapOfPrimitiveTypesCombination()
+dfExample1.cache()
+dfExample2.cache()
+dfExample3.cache()
+dfExample4.cache()
+// Test with cached relation, the Project will be evaluated with 
codegen
+testMapOfPrimitiveTypesCombination()
+  }
+
+  test("transform keys function - Invalid lambda functions and 
exceptions") {
+val dfExample1 = Seq(
+  Map[Int, Int](1 -> 1, 9 -> 9, 8 -> 8, 7 -> 7)
+).toDF("i")
+
+val dfExample2 = Seq(
+  Map[String, String]("a" -> "b")
+).toDF("j")
+
+val dfExample3 = Seq(
+  Map[String, String]("a" -> null)
+).toDF("x")
+
+def testInvalidLambdaFunctions(): Unit = {
+  val ex1 = intercept[AnalysisException] {
+dfExample1.selectExpr("transform_keys(i, k -> k )")
+  }
+  assert(ex1.getMessage.contains("The number of lambda function 
arguments '1' does not match"))
+
+  val ex2 = intercept[AnalysisException] {
+dfExample2.selectExpr("transform_keys(j, (k, v, x) -> k + 1)")
+  }
+  assert(ex2.getMessage.contains(
+  "The number of lambda function arguments '3' does not match"))
+
+  val ex3 = intercept[RuntimeException] {
+dfExample3.selectExpr("transform_keys(x, (k, v) -> v)").show()
+  }
+  assert(ex3.getMessage.contains("Cannot use null as map key!"))
+}
+
+testInvalidLambdaFunctions()
+dfExample1.cache()
+dfExample2.cache()
+testInvalidLambdaFunctions()
--- End diff --

@ueshin I would like to ask you a generic question regarding higher-order 
functions. Is it necessary to perform checks with codegen paths if all the 
newly added functions extends from ```CodegenFallback```? Eventually, is there 
a plan to add coden for these functions in future? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22084: [SPARK-25026][BUILD] Binary releases should contain some...

2018-08-15 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22084
  
I'm going to say let's leave this alone for now then if there isn't 
consensus. We can easily reopen later if someone wanted to just add the JARs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22084: [SPARK-25026][BUILD] Binary releases should conta...

2018-08-15 Thread srowen
Github user srowen closed the pull request at:

https://github.com/apache/spark/pull/22084


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22013: [SPARK-23939][SQL] Add transform_keys function

2018-08-15 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/22013#discussion_r210366383
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -497,6 +497,62 @@ case class ArrayAggregate(
   override def prettyName: String = "aggregate"
 }
 
+/**
+ * Transform Keys for every entry of the map by applying the 
transform_keys function.
+ * Returns map with transformed key entries
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(expr, func) - Transforms elements in a map using the 
function.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 
1);
+   map(array(2, 3, 4), array(1, 2, 3))
+  > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 
v);
+   map(array(2, 4, 6), array(1, 2, 3))
+  """,
+  since = "2.4.0")
+case class TransformKeys(
+argument: Expression,
+function: Expression)
+  extends MapBasedSimpleHigherOrderFunction with CodegenFallback {
+
+  override def nullable: Boolean = argument.nullable
+
+  @transient lazy val MapType(keyType, valueType, valueContainsNull) = 
argument.dataType
+
+  override def dataType: DataType = {
+MapType(function.dataType, valueType, valueContainsNull)
--- End diff --

nit: just in one line?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22013: [SPARK-23939][SQL] Add transform_keys function

2018-08-15 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/22013#discussion_r210368929
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -497,6 +497,62 @@ case class ArrayAggregate(
   override def prettyName: String = "aggregate"
 }
 
+/**
+ * Transform Keys for every entry of the map by applying the 
transform_keys function.
+ * Returns map with transformed key entries
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(expr, func) - Transforms elements in a map using the 
function.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 
1);
+   map(array(2, 3, 4), array(1, 2, 3))
+  > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 
v);
+   map(array(2, 4, 6), array(1, 2, 3))
+  """,
+  since = "2.4.0")
+case class TransformKeys(
+argument: Expression,
+function: Expression)
+  extends MapBasedSimpleHigherOrderFunction with CodegenFallback {
+
+  override def nullable: Boolean = argument.nullable
+
+  @transient lazy val MapType(keyType, valueType, valueContainsNull) = 
argument.dataType
+
+  override def dataType: DataType = {
+MapType(function.dataType, valueType, valueContainsNull)
+  }
+
+  override def bind(f: (Expression, Seq[(DataType, Boolean)]) => 
LambdaFunction): TransformKeys = {
+copy(function = f(function, (keyType, false) :: (valueType, 
valueContainsNull) :: Nil))
+  }
+
+  @transient lazy val LambdaFunction(
+  _, (keyVar: NamedLambdaVariable) :: (valueVar: NamedLambdaVariable) :: 
Nil, _) = function
+
+
+  override def nullSafeEval(inputRow: InternalRow, argumentValue: Any): 
Any = {
+val map = argumentValue.asInstanceOf[MapData]
+val f = functionForEval
--- End diff --

Can't we use ```functionForEval``` directly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
AnalysisBarrier does not introduce a behavior change. However, this 
requires our analyzer rules must be idempotent. The most recent correctness bug 
also shows another big potential hole 
https://issues.apache.org/jira/browse/SPARK-25051. It could break Analyzer 
rules without notice, since AnalysisBarrier is a leaf node. Basically, this is 
a disaster for Spark 2.3 release. You can count how many regressions we found 
after the release. This is bad. Unfortunately, I made the merge without enough 
consideration. If possible, I would revert that PR, but it is too late. 

Now, we are facing the similar issue again. We should stop continuing this 
direction without a proper design and review. Designing/enhancing the codegen 
framework requires more inputs from the experts in this area. I do not think 
the current design is well discussed, even if I saw some discussions in the 
initial PR. I am not the expert in this area. That is why I pinged @kiszk to 
drive this. 

BTW, the internal APIs we can change easily, but the design need a review 
especially for the core components of Spark SQL. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22101: [SPARK-25114][Core] Fix RecordBinaryComparator when subt...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22101
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22101: [SPARK-25114][Core] Fix RecordBinaryComparator when subt...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22101
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94803/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22101: [SPARK-25114][Core] Fix RecordBinaryComparator when subt...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22101
  
**[Test build #94803 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94803/testReport)**
 for PR 22101 at commit 
[`9c1f486`](https://github.com/apache/spark/commit/9c1f4860ea794e60a127ae3ec2d1b518a182edf5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94809/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22042
  
**[Test build #94809 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94809/testReport)**
 for PR 22042 at commit 
[`a9b00b4`](https://github.com/apache/spark/commit/a9b00b4a22f0b6b364cd1b35e2d99923d8b233dc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94808/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22042
  
**[Test build #94808 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94808/testReport)**
 for PR 22042 at commit 
[`baef29f`](https://github.com/apache/spark/commit/baef29f2983560c8010681c9bb7e74f711c8f2e7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20637
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94806/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20637
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21537
  
I am sorry if I misunderstood. This was discussed sufficiently in the 
linked PRs. These are internal APIs and does not introduce behaviour changes by 
AnalysisBarrier which was a good enough idea to try because it brought many 
benefits to reduce reanalyzing plans - this resolved many JIRA on the other 
hand, and I don't think AnalysisBarrier was particularly a disaster. Sometimes 
we should take a step forward and see if that causes an actual problem or not 
too. That was a good enough try.

When something should be reverted, there should usually be specific 
concerns; otherwise, I wouldn't revert this for vague concerns for now.

This is an internal API that we can freely change later as well. Also, 
@viirya made a lot of efforts here and he's pretty active. I don't think 
dragging someone into this is a good idea for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20637
  
**[Test build #94806 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94806/testReport)**
 for PR 20637 at commit 
[`99731ca`](https://github.com/apache/spark/commit/99731ca058f0d8946397530aff76d3c55fa93162).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Schema(dataType: DataType, nullable: Boolean)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22013: [SPARK-23939][SQL] Add transform_keys function

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22013
  
**[Test build #94811 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94811/testReport)**
 for PR 22013 at commit 
[`fb885f4`](https://github.com/apache/spark/commit/fb885f4797e72d0c2cbfa23980199c71e0c5aaee).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22001
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21912
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21912
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2217/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22001
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94801/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21912
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22001
  
**[Test build #94801 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94801/testReport)**
 for PR 22001 at commit 
[`c9036aa`](https://github.com/apache/spark/commit/c9036aab22cfd6b7a4939f4b23741612706ba2a6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21912
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94810/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21912
  
**[Test build #94810 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94810/testReport)**
 for PR 21912 at commit 
[`b899259`](https://github.com/apache/spark/commit/b899259125ed2c7b94c0b121d6cbdeb4ee90285a).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21912
  
**[Test build #94810 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94810/testReport)**
 for PR 21912 at commit 
[`b899259`](https://github.com/apache/spark/commit/b899259125ed2c7b94c0b121d6cbdeb4ee90285a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...

2018-08-15 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/22110#discussion_r210357474
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala 
---
@@ -73,4 +73,14 @@ object TypeUtils {
 }
 x.length - y.length
   }
+
+  /**
+   * Returns true if elements of the data type could be used as items of a 
hash set or as keys
--- End diff --

What about changing the name of the method to ```typeCanBeHashed```?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
This is not related to the credit or not. I think we need to introduce an 
IR like what the compiler is doing instead of continuous improvement on the 
existing one, which is already very hacky and hard to debug and maintain. We 
are wasting the efforts and resources if we follow the current direction. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22042
  
**[Test build #94809 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94809/testReport)**
 for PR 22042 at commit 
[`a9b00b4`](https://github.com/apache/spark/commit/a9b00b4a22f0b6b364cd1b35e2d99923d8b233dc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2216/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21537
  
Sure, I am fine with @kiszk leading the effort of course, I just wanted to 
give credit to @viirya for the (current) design as it was mostly done by him.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22042
  
**[Test build #94808 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94808/testReport)**
 for PR 22042 at commit 
[`baef29f`](https://github.com/apache/spark/commit/baef29f2983560c8010681c9bb7e74f711c8f2e7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22105: [SPARK-25115] [Core] Eliminate extra memory copy ...

2018-08-15 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22105#discussion_r210354443
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java
 ---
@@ -140,8 +140,24 @@ private int copyByteBuf(ByteBuf buf, 
WritableByteChannel target) throws IOExcept
 // SPARK-24578: cap the sub-region's size of returned nio buffer to 
improve the performance
 // for the case that the passed-in buffer has too many components.
 int length = Math.min(buf.readableBytes(), NIO_BUFFER_LIMIT);
--- End diff --

Maybe @normanmaurer has some insight?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2215/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...

2018-08-15 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22110#discussion_r210353947
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala 
---
@@ -73,4 +73,14 @@ object TypeUtils {
 }
 x.length - y.length
   }
+
+  /**
+   * Returns true if elements of the data type could be used as items of a 
hash set or as keys
--- End diff --

I think it is a common pattern for every well formed class to have equals 
and hashCode defined in a coherent way. I think what we are doing here is 
saying: this class has a meaningful equals method, so we can rely on it or this 
class has a meaningless equals method, like the default one comparing the 
pointers, so we cannot rely on it. I am open too to any suggestion, I'd only 
like to have a description which is coherent with the method name, otherwise I 
feel that either one or the other has to be changed in order to properly 
reflect what the method does.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-15 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21698
  
so I think the assumption is that task results are idempotent but not 
ordered.  Sorry if that contradictory.   The data itself has to be the same on 
rerun but the order of things in there doesn't.   That was my general 
assumption.  I think zip doesn't follow that though when the inputs aren't 
ordered.  Not sure if there are others spark supports, need to go through the 
list I guess, unless someone already has?

I think we just need to document these operations and say the results can 
be inconsistent if not sorted or perhaps give them an option to also sort.  
Either that or we have to say we don't support unordered output at all in 
Spark.Thoughts on just documenting zip or others with unordered input?

I don't think mapreduce and pig have this issue because they don't 
internally support an operation like zip, everything is on key/values and 
joins, groupby on the keys.  User code there could generate it as well but I 
would claim its the users fault there. 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
This is bad. The design needs to be carefully reviewed before implementing 
it. Basically, we breaks the basic principles of software engineering. It is 
very strange to write the design doc after introducing such a fundamental API 
change. 

I think we need more expertise in the existing compiler internal experts. 
That is why I proposed to let @kiszk to lead this. We are not inventing 
anything new in our codegen framework. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22105: [SPARK-25115] [Core] Eliminate extra memory copy ...

2018-08-15 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22105#discussion_r210353176
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java
 ---
@@ -140,8 +140,24 @@ private int copyByteBuf(ByteBuf buf, 
WritableByteChannel target) throws IOExcept
 // SPARK-24578: cap the sub-region's size of returned nio buffer to 
improve the performance
 // for the case that the passed-in buffer has too many components.
 int length = Math.min(buf.readableBytes(), NIO_BUFFER_LIMIT);
--- End diff --

That was only briefly discussed in the PR that Ryan linked above... the 
original code actually used 512k.

I think Hadoop's limit is a little low, but maybe 256k is a bit high. IIRC 
socket buffers are 32k by default on Linux, so it seems unlikely you'd be able 
to write 256k in one call (ignoring what IOUtil does internally). But maybe in 
practice it works ok.

If anyone has the time to test this out that would be great.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21537
  
@gatorsmile I think this is just the partial adoption of what was proposed 
[here](https://github.com/apache/spark/pull/19813#issuecomment-354045400) and 
implemented in SPARK-24121 (https://github.com/apache/spark/pull/21193).

Despite I agree with you that we should have created a design doc for this, 
I think now is a bit late for it. Indeed the base API is already there, since 
the design has mostly been done in SPARK-24121 (and not in the scope of this 
PR). What is missing now, though, is a throughout adoption. But there are 
already many PRs which based on that API: SPARK-22856, SPARK-23951. And the 
current codebase already partially adopts it.

So I am not sure whether reverting this PR would be a solution: we should 
revert many if we want the change the current design.

I think a design can be created anyway, at least in order to formalize what 
was discussed and agreed in the various PRs and have a single source of 
information. We can also do some modifications to the current design according 
to the comments which could come out discussing the design doc, but I see them 
more like further changes than reverting this, as it would mean reverting many 
things.

The only last thing I'd suggest is that probably @viirya is the best one 
for creating the design doc, since the original proposal and implementation was 
done by him.

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...

2018-08-15 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/22110#discussion_r210350632
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala 
---
@@ -73,4 +73,14 @@ object TypeUtils {
 }
 x.length - y.length
   }
+
+  /**
+   * Returns true if elements of the data type could be used as items of a 
hash set or as keys
--- End diff --

I'm open to any changes :) But if you want to explicitly mention the 
```equals``` method, I would also mention ```hashCode``` generally needed for 
usage in "hash" collections. But then this not 100% true for Spark's 
specialized ```OpenHashSets``` and ```OpenHashMaps``` since they calculate hash 
by themselves. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22105: [SPARK-25115] [Core] Eliminate extra memory copy ...

2018-08-15 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22105#discussion_r210350203
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java
 ---
@@ -140,8 +140,24 @@ private int copyByteBuf(ByteBuf buf, 
WritableByteChannel target) throws IOExcept
 // SPARK-24578: cap the sub-region's size of returned nio buffer to 
improve the performance
 // for the case that the passed-in buffer has too many components.
 int length = Math.min(buf.readableBytes(), NIO_BUFFER_LIMIT);
--- End diff --

Out of my curiosity, how do we come out with number of NIO_BUFFER_LIMIT 
256KB?

In Hadoop, they are using 
[8KB](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L3234)

For most OSes, in the `write(ByteBuffer[])` API in `sun.nio.ch.IOUtil`, it 
goes one buffer at a time, and gets a temporary direct buffer from the 
`BufferCache`, up to a limit of `IOUtil#IOV_MAX` which is 1KB. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-15 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r210350101
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1892,6 +1892,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - In version 2.3 and earlier, Spark converts Parquet Hive tables by 
default but ignores table properties like `TBLPROPERTIES (parquet.compression 
'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES 
(orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, 
too. Since Spark 2.4, Spark respects Parquet/ORC specific table properties 
while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id 
int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would 
generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, 
the result would be uncompressed parquet files.
   - Since Spark 2.0, Spark converts Parquet Hive tables by default for 
better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, 
too. It means Spark uses its own ORC support by default instead of Hive SerDe. 
As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with 
Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's 
ORC data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
   - In version 2.3 and earlier, CSV rows are considered as malformed if at 
least one column value in the row is malformed. CSV parser dropped such rows in 
the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 
2.4, CSV row is considered as malformed only when it contains malformed column 
values requested from CSV datasource, other values can be ignored. As an 
example, CSV file contains the "id,name" header and one row "1234". In Spark 
2.4, selection of the id column consists of a row with one column value 1234 
but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore 
the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to 
`false`.
+  - Since Spark 2.4, text-based datasources like CSV and JSON don't parse 
input lines if the required schema pushed down to the datasources is empty. The 
schema can be empty in the case of count(), for example. To set `true` to 
`spark.sql.legacy.bypassParserForEmptySchema` restores the previous behavior 
when the underlying parser is always invoked even for the empty schema. This 
option will be removed in Spark 3.0.
--- End diff --

> Does it also mean the result of count() can change if the json/csv files 
contain malformed records?

@cloud-fan Only the difference I have found so far is when a whole JSON 
file is invalid i.e it has wrong enconding, for example. In other cases 
`count()` returns the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/21889
  
> Due to the urgency of the upcoming 2.4 code freeze, I'm going to open 
this PR to collect any feedback. This can be closed if you prefer to continue 
to the work in the original PR.

That would be my preference, yes, especially if it means less work for you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/21320
  
> @mallman if you're planning on making more code changes, would you be 
willing to work on a shared branch or something? I've been working to 
incorporate the CR comments.

No, however if you want to open a PR against the VideoAmp 
spark-4502-parquet_column_pruning-foundation branch I will review your changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94802/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22009
  
**[Test build #94802 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94802/testReport)**
 for PR 22009 at commit 
[`e6e599a`](https://github.com/apache/spark/commit/e6e599a9630801078c046d3aec398cf4f046945c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21123: [SPARK-24045][SQL]Create base class for file data source...

2018-08-15 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21123
  
> We should do things incrementally and always prepare for the worst case.

This is why I'm pushing for Append support and adding interfaces to finish 
the logical plans. Releasing logical plans as we get them tested and working 
*is* the incremental solution. Appending to existing tables and having full 
read support is fine for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21123: [SPARK-24045][SQL]Create base class for file data source...

2018-08-15 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21123
  
> this mapping is not mentioned in the logical plan standardization design 
doc and I doubt if it's doable

I agree! This is why I propose we add an entirely new API for v2 with clear 
behavior. This is also why I'm working on the logical plans and interfaces 
instead of the public API right now.

> I don't agree according to the fact that these new logical plans are not 
ready.

I don't think that wanting to use v2 before it is ready is a justification 
for releasing a path with unpredictable behavior that we don't intend to 
continue support for. We should let the v1 write API continue to use v1 
behavior and fix behavior for v2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...

2018-08-15 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22110#discussion_r210343741
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala 
---
@@ -73,4 +73,14 @@ object TypeUtils {
 }
 x.length - y.length
   }
+
+  /**
+   * Returns true if elements of the data type could be used as items of a 
hash set or as keys
--- End diff --

I think this comment is not very coherent with the method name. I think we 
can rephrase to something like:
```
Returns true if the equal method of the elements of the data type is 
implemented properly.
This also means that they can be safely used in collections relying on the 
equals method,
as sets or maps.
```
What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   >