[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r229181821
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
--- End diff --

@gatorsmile Do you think its a good time to revisit Natt's PR to convert 
subquery expressions to Joins early in the optimization process ?  Perhaps then 
we can take advantage of all the subsequent rules firing after the subquery 
rewrite ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22867: [SPARK-25778] WriteAheadLogBackedBlockRDD in YARN Cluste...

2018-10-29 Thread gss2002
Github user gss2002 commented on the issue:

https://github.com/apache/spark/pull/22867
  
@vanzin this seems to work.. Not sure what your thoughts are on this

  private val tmpDir = broadcastedHadoopConf.value.get("hadoop.tmp.dir",
System.getProperty("java.io.tmpdir"))


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22778: [SPARK-25784][SQL] Infer filters from constraints...

2018-10-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22778#discussion_r229178084
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog)
 // "Extract PythonUDF From JoinCondition".
 Batch("Check Cartesian Products", Once,
   CheckCartesianProducts) :+
-Batch("RewriteSubquery", Once,
+Batch("Rewrite Subquery", Once,
--- End diff --

I do not have a good answer for this PR. Ideally, we should run the whole 
batch `operatorOptimizationBatch`. However, running the rules could be very 
time consuming. I would suggest to add a new parameter for introducing the time 
bound limit for each batch.

cc @maryannxue  WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22867: [SPARK-25778] WriteAheadLogBackedBlockRDD in YARN Cluste...

2018-10-29 Thread gss2002
Github user gss2002 commented on the issue:

https://github.com/apache/spark/pull/22867
  
@vanzin I made the following change and it didn't work. How do you want to 
proceed?
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
@@ -98,6 +98,8 @@ class WriteAheadLogBackedBlockRDD[T: ClassTag](
 
   override def isValid(): Boolean = true
 
+  private val tmpDir = "file:///" + System.getProperty("java.io.tmpdir")
+
   override def getPartitions: Array[Partition] = {
 assertValid()
 Array.tabulate(_blockIds.length) { i =>
@@ -136,7 +138,7 @@ class WriteAheadLogBackedBlockRDD[T: ClassTag](
 // this dummy directory should not already exist otherwise the WAL 
will try to recover
 // past events from the directory and throw errors.
 val nonExistentDirectory = new File(
-  System.getProperty("java.io.tmpdir"), 
UUID.randomUUID().toString).getAbsolutePath
+  tmpDir, UUID.randomUUID().toString).getAbsolutePath
 writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
 dataRead = writeAheadLog.read(partition.walRecordHandle)

It did not work.. Looks like it needs an HDFS path..

18/10/30 01:06:50 ERROR Executor: Exception in task 0.2 in stage 3.0 (TID 
73)
org.apache.spark.SparkException: Could not read data from write ahead log 
record 
FileBasedWriteAheadLogSegment(hdfs://hdpmgmt1.hdp.senia.org:8020/user/sparkstreaming/sparkstreaming/Spark_Streaming_MQ/receivedData/0/log-1540875768007-1540875828007,0,988)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:147)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:175)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:175)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:175)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Pathname 
/hadoop/yarn/local/usercache/sparkstreaming/appcache/application_1540875581273_0002/container_e09_1540875581273_0002_01_02/file:/hadoop/yarn/local/usercache/sparkstreaming/appcache/application_1540875581273_0002/container_e09_1540875581273_0002_01_01/tmp/57b411fc-d4ed-4ac9-a32c-ecfc901dc29e
 from 
/hadoop/yarn/local/usercache/sparkstreaming/appcache/application_1540875581273_0002/container_e09_1540875581273_0002_01_02/file:/hadoop/yarn/local/usercache/sparkstreaming/appcache/application_1540875581273_0002/container_e09_1540875581273_0002_01_01/tmp/57b411fc-d4ed-4ac9-a32c-ecfc901dc29e
 is not a valid DFS filename.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:197)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.initializeOrRecover(FileBasedWriteAheadLog.scala:245)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.(FileBasedWriteAheadLog.scala:80)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:142)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:142)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$.createLog(WriteAheadLogUtils.scala:141)
at 

[GitHub] spark issue #22879: [SPARK-25872][SQL][TEST] Add an optimizer tracker for TP...

2018-10-29 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22879
  
For generating the golden files, I would not generate the whole plan. 
Instead, we can keep the node name. That should be good enough. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22885
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4619/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22885
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-29 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22857
  
Please be really careful in null handling. It could easily introduce the 
correctness bugs like what we recently fixed. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22885
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98250/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22885
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22885
  
**[Test build #98250 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98250/testReport)**
 for PR 22885 at commit 
[`f7c63d5`](https://github.com/apache/spark/commit/f7c63d52948c3f7c6bc6aa8ebed5ce8684403101).
 * This patch **fails build dependency tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22885
  
**[Test build #98250 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98250/testReport)**
 for PR 22885 at commit 
[`f7c63d5`](https://github.com/apache/spark/commit/f7c63d52948c3f7c6bc6aa8ebed5ce8684403101).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22885: [BUILD][MINOR] release script should not interrup...

2018-10-29 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/22885

[BUILD][MINOR] release script should not interrupt by svn

## What changes were proposed in this pull request?

When running the release script, you will be interrupted unexpectedly
```
ATTENTION!  Your password for authentication realm:

    ASF Committers

can only be stored to disk unencrypted!  You are advised to configure
your system so that Subversion can store passwords encrypted, if
possible.  See the documentation for details.

You can avoid future appearances of this warning by setting the value
of the 'store-plaintext-passwords' option to either 'yes' or 'no' in
'/home/spark-rm/.subversion/servers'.
---
Store password unencrypted (yes/no)?
```

We can avoid it by adding `--no-auth-cache` when running svn command.

## How was this patch tested?

manually verified with 2.4.0 RC5


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark svn

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22885.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22885


commit f7c63d52948c3f7c6bc6aa8ebed5ce8684403101
Author: Wenchen Fan 
Date:   2018-10-19T14:16:42Z

release script should not interrupt




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...

2018-10-29 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22885
  
cc @vanzin @srowen @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-10-29 Thread Gauravshah
Github user Gauravshah commented on the issue:

https://github.com/apache/spark/pull/21320
  
backported to 2.3.2 just in case somebody needs it. 
https://github.com/Gauravshah/spark/tree/branch-2.3_SPARK-4502 Thanks @mallman 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22884: [SPARK-23429][CORE][FOLLOWUP] MetricGetter should rename...

2018-10-29 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/22884
  
good catch, lgtm


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22844
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22844
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98242/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22881: [SPARK-25855][CORE] Don't use erasure coding for event l...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22881
  
**[Test build #98249 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98249/testReport)**
 for PR 22881 at commit 
[`bb12e39`](https://github.com/apache/spark/commit/bb12e390320b51d64b07159c013e79167d71db08).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22844
  
**[Test build #98242 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98242/testReport)**
 for PR 22844 at commit 
[`2c73385`](https://github.com/apache/spark/commit/2c73385055a9a4357117f6ad6de53112e1f5497b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22881: [SPARK-25855][CORE] Don't use erasure coding for event l...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22881
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4618/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22881: [SPARK-25855][CORE] Don't use erasure coding for event l...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22881
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22881: [SPARK-25855][CORE] Don't use erasure coding for event l...

2018-10-29 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/22881
  
thanks for the review @vanzin & @kiszk , updated


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22881: [SPARK-25855][CORE] Don't use erasure coding for ...

2018-10-29 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/22881#discussion_r229172448
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -471,4 +473,42 @@ object SparkHadoopUtil {
   hadoopConf.set(key.substring("spark.hadoop.".length), value)
 }
   }
+
+
+  lazy val builderReflection: Option[(Class[_], Method, Method)] = Try {
+val cls = Utils.classForName(
+  
"org.apache.hadoop.hdfs.DistributedFileSystem$HdfsDataOutputStreamBuilder")
+(cls, cls.getMethod("replicate"), cls.getMethod("build"))
+  }.toOption
+
+  // scalastyle:off line.size.limit
+  /**
+   * Create a path that uses replication instead of erasure coding, 
regardless of the default
+   * configuration in hdfs for the given path.  This can be helpful as 
hdfs ec doesn't support
+   * hflush(), hsync(), or append()
+   * 
https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations
+   */
+  // scalastyle:on line.size.limit
+  def createNonECFile(fs: FileSystem, path: Path): FSDataOutputStream = {
+try {
+  // Use reflection as this uses apis only avialable in hadoop 3
+  val builderMethod = fs.getClass().getMethod("createFile", 
classOf[Path])
+  val builder = builderMethod.invoke(fs, path)
+  builderReflection match {
--- End diff --

good point on the reflection, I was trying something else in earlier 
experiments and didn't clean up.

on poking into `DistributedFileSystem` -- @xiao-chen had similar concerns, 
but also said it seemed there wasn't another option and it looked like an 
oversight in the hdfs api.  @steveloughran maybe you have thoughts here as well?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use m...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22845
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98241/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use m...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22845
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use m...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22845
  
**[Test build #98241 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98241/testReport)**
 for PR 22845 at commit 
[`3c0eb0a`](https://github.com/apache/spark/commit/3c0eb0a0390c2dd602334b0f0e51d3e9f7c1bbec).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22884: [SPARK-23429][CORE][FOLLOWUP] MetricGetter should rename...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22884
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22884: [SPARK-23429][CORE][FOLLOWUP] MetricGetter should rename...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22884
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22884: [SPARK-23429][CORE][FOLLOWUP] MetricGetter should rename...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22884
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22884: [SPARK-23429][CORE][FOLLOWUP] MetricGetter should...

2018-10-29 Thread LantaoJin
GitHub user LantaoJin opened a pull request:

https://github.com/apache/spark/pull/22884

[SPARK-23429][CORE][FOLLOWUP] MetricGetter should rename to 
ExecutorMetricType in comments

## What changes were proposed in this pull request?

MetricGetter should rename to ExecutorMetricType in comments.

## How was this patch tested?

Just comments, no need to test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LantaoJin/spark SPARK-23429_FOLLOWUP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22884.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22884


commit c0a224268e1c066c4062e6c8dfef0c5ae82d69eb
Author: LantaoJin 
Date:   2018-10-30T03:50:40Z

[SPARK-23429][CORE][FOLLOWUP] MetricGetter should rename to 
ExecutorMetricType in comments




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22857
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98239/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22857
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22857
  
**[Test build #98239 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98239/testReport)**
 for PR 22857 at commit 
[`0eac890`](https://github.com/apache/spark/commit/0eac890be26cb0960e245b88fd771233d21850ee).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r229165719
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,60 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * This rule applies to conditions in [[Filter]] and [[Join]]. Moreover, 
it transforms predicates
+ * in all [[If]] expressions as well as branch conditions in all 
[[CaseWhen]] expressions.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As this rule is not limited to conditions in [[Filter]] and [[Join]], 
arbitrary plans can
+ * benefit from it. For example, `Project(If(And(cond, Literal(null)), 
Literal(1), Literal(2)))`
+ * can be simplified into `Project(Literal(2))`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case cw @ CaseWhen(branches, _) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+cw.copy(branches = newBranches)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case cw: CaseWhen if cw.dataType == BooleanType =>
--- End diff --

When `cw.dataType != BooleanType`, we can still do 
`replaceNullWithFalse(cond)`, don't we?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r229165497
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,60 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * This rule applies to conditions in [[Filter]] and [[Join]]. Moreover, 
it transforms predicates
+ * in all [[If]] expressions as well as branch conditions in all 
[[CaseWhen]] expressions.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As this rule is not limited to conditions in [[Filter]] and [[Join]], 
arbitrary plans can
+ * benefit from it. For example, `Project(If(And(cond, Literal(null)), 
Literal(1), Literal(2)))`
+ * can be simplified into `Project(Literal(2))`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case cw @ CaseWhen(branches, _) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+cw.copy(branches = newBranches)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case cw: CaseWhen if cw.dataType == BooleanType =>
+  val newBranches = cw.branches.map { case (cond, value) =>
+replaceNullWithFalse(cond) -> replaceNullWithFalse(value)
+  }
+  val newElseValue = cw.elseValue.map(replaceNullWithFalse)
+  CaseWhen(newBranches, newElseValue)
+case i @ If(pred, trueVal, falseVal) if i.dataType == BooleanType =>
+  If(replaceNullWithFalse(pred), replaceNullWithFalse(trueVal), 
replaceNullWithFalse(falseVal))
--- End diff --

When `i.dataType != BooleanType`, we still can do 
`replaceNullWithFalse(pred)`, don't we?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22309
  
**[Test build #98248 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98248/testReport)**
 for PR 22309 at commit 
[`d70cac5`](https://github.com/apache/spark/commit/d70cac54981f9929c5c36f9e4f5c423eae5fd95f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...

2018-10-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16429
  
This is fixed from Spark 1.6.4, 2.0.3, 2.1.1 and 2.2.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-10-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22878
  
Just quickly and roughly tested. Merge script looks only recognising main 
author of each commit in a PR. Let's just push a commit into here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...

2018-10-29 Thread jamal119
Github user jamal119 commented on the issue:

https://github.com/apache/spark/pull/16429
  
hi,HyukjinKwon,have you solved your problem,my error is exactly the same 
as yours,looking forward to your reply.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-10-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22878
  
I wonder if that can be handled by merge script tho. I think it's okay just 
to pick up some commits there and rebase them to here even if they are empty 
commits. That's easier for committers to put his name as primary author when 
it's merged.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-10-29 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22878
  
@xuanyuanking , thanks for the work!
You can try editing the previous commit message 
https://help.github.com/articles/creating-a-commit-with-multiple-authors/ , and 
then `push -f`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22861: [SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDa...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22861
  
**[Test build #98247 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98247/testReport)**
 for PR 22861 at commit 
[`93ca3be`](https://github.com/apache/spark/commit/93ca3be2d860516141ff74188d775c17c3afce78).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22861: [SPARK-25663][SPARK-25661][SQL][TEST] Refactor Bu...

2018-10-29 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22861#discussion_r229162191
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroWriteBenchmark.scala
 ---
@@ -19,22 +19,17 @@ package org.apache.spark.sql.execution.benchmark
 
 /**
  * Benchmark to measure Avro data sources write performance.
- * Usage:
- * 1. with spark-submit: bin/spark-submit --class  
- * 2. with sbt: build/sbt "avro/test:runMain "
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class 
--- End diff --

@yucai Good catch! I think it needs ``, added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22778: [SPARK-25784][SQL] Infer filters from constraints after ...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22778
  
**[Test build #98246 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98246/testReport)**
 for PR 22778 at commit 
[`db519c3`](https://github.com/apache/spark/commit/db519c3f0241cb13880665eca934e6d6e34e6fd7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22778: [SPARK-25784][SQL] Infer filters from constraints after ...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22778
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22778: [SPARK-25784][SQL] Infer filters from constraints after ...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22778
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4617/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22857
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98240/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22857
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22857
  
**[Test build #98240 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98240/testReport)**
 for PR 22857 at commit 
[`4c35955`](https://github.com/apache/spark/commit/4c35955582d3e38722702049780c3c9946a695da).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22861: [SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDa...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22861
  
**[Test build #98245 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98245/testReport)**
 for PR 22861 at commit 
[`f06b2ac`](https://github.com/apache/spark/commit/f06b2acd93a87e64a955b34cbf4fd2ca7d5faf00).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-10-29 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/22878
  
also cc @bdrillard, link this to #21348.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-10-29 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/22878
  
@dongjoon-hyun Thanks for your comment, let me see how to achieve this, 
@bdrillard 's commits based on databricks/spark-avro.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22624: [SPARK-23781][CORE] Merge token renewer functionality in...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22624
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22624: [SPARK-23781][CORE] Merge token renewer functionality in...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22624
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98238/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22624: [SPARK-23781][CORE] Merge token renewer functionality in...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22624
  
**[Test build #98238 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98238/testReport)**
 for PR 22624 at commit 
[`88fe6cb`](https://github.com/apache/spark/commit/88fe6cb9105d140d5ebb250baee2623bb07650ce).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22878
  
**[Test build #98244 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98244/testReport)**
 for PR 22878 at commit 
[`697813a`](https://github.com/apache/spark/commit/697813af447fb7cf6c7a2953a7d7eb5f49f5cdb5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22878
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22878
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4616/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-29 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r229156708
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -128,6 +128,15 @@ object ScalaReflection extends ScalaReflection {
 case _ => false
   }
 
+  def isValueClass(tpe: `Type`): Boolean = {
+tpe.typeSymbol.asClass.isDerivedValueClass
+  }
+
+  /** Returns the name and type of the underlying parameter of value class 
`tpe`. */
+  def getUnderlyingParameterOf(tpe: `Type`): (String, Type) = {
+getConstructorParameters(tpe).head
--- End diff --

not sure, I can't find any


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
Thanks, @cloud-fan. The change looks good to me from my side. Let me take 
another look for this and leave a sign-off (which means a sign-off for 
@MaxGekk's code changes)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21688: [SPARK-21809] : Change Stage Page to use datatables to s...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21688
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98236/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...

2018-10-29 Thread mt40
Github user mt40 commented on the issue:

https://github.com/apache/spark/pull/22309
  
@cloud-fan while what you said is true, I think it's not that bad. Overall, 
the only major special logic is inside the case class `case` and this is due to 
the special nature of value class.
On the other side, with value class support, unit testing Spark code with 
[ScalaCheck](https://github.com/rickynils/scalacheck) becomes much more easier 
which is also the main reason I'm trying to implement this patch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21688: [SPARK-21809] : Change Stage Page to use datatables to s...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21688
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21688: [SPARK-21809] : Change Stage Page to use datatables to s...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21688
  
**[Test build #98236 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98236/testReport)**
 for PR 21688 at commit 
[`0ed3309`](https://github.com/apache/spark/commit/0ed3309debeb73575e5fdb1b342ce70a8b05f9db).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229156006
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
+```
+
+Instead, you should create `v1` as below with column aliases 
explicitly specified.
+
+```
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1 AS inc_c1, upper(c2) AS 
upper_c2 FROM t1) t2;
--- End diff --

Sure, updated as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22755: [SPARK-25755][SQL][Test] Supplementation of non-C...

2018-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22755#discussion_r229155463
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -669,22 +669,20 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
-  Seq(true, false).foreach { codegen =>
+  withWholeStageCodegenOnAndOff { codegenEnabled =>
 test("SPARK-22951: dropDuplicates on empty dataFrames should produce 
correct aggregate " +
-  s"results when codegen is enabled: $codegen") {
-  withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, 
codegen.toString)) {
-// explicit global aggregations
-val emptyAgg = Map.empty[String, String]
-checkAnswer(spark.emptyDataFrame.agg(emptyAgg), Seq(Row()))
-checkAnswer(spark.emptyDataFrame.groupBy().agg(emptyAgg), 
Seq(Row()))
-checkAnswer(spark.emptyDataFrame.groupBy().agg(count("*")), 
Seq(Row(0)))
-checkAnswer(spark.emptyDataFrame.dropDuplicates().agg(emptyAgg), 
Seq(Row()))
-
checkAnswer(spark.emptyDataFrame.dropDuplicates().groupBy().agg(emptyAgg), 
Seq(Row()))
-
checkAnswer(spark.emptyDataFrame.dropDuplicates().groupBy().agg(count("*")), 
Seq(Row(0)))
-
-// global aggregation is converted to grouping aggregation:
-assert(spark.emptyDataFrame.dropDuplicates().count() == 0)
-  }
+  s"results when codegen is enabled: $codegenEnabled") {
--- End diff --

we can add the test name postfix in `withWholeStageCodegenOnAndOff`, so 
that the caller side only need to provide a base name. e.g.
```
def withWholeStageCodegenOnAndOff(testName: String)(f: String => Unit): 
Unit = {
  Seq("false", "true").foreach { enabled =>
test(s"$testName (whole-stage-codegen ${if enabled "on" else "off"})") 
...
  }
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22881: [SPARK-25855][CORE] Don't use erasure coding for ...

2018-10-29 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22881#discussion_r229155491
  
--- Diff: docs/configuration.md ---
@@ -761,6 +761,17 @@ Apart from these, the following properties are also 
available, and may be useful
 Compression will use spark.io.compression.codec.
   
 
+
+  spark.eventLog.allowErasureCoding
+  false
+  
+Whether to allow event logs to use erasure coding, or turn erasure 
coding off, regardless of
+filesystem defaults.  On HDFS, erasure coded files will not update as 
quickly as regular
+replicated files, so they application updates will take longer to 
appear in the History Server.
+Note that even if this is true, spark will still not force the file to 
erasure coding, it will
--- End diff --

nit: `to erasure coding` -> `to use erasure coding`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22755: [SPARK-25755][SQL][Test] Supplementation of non-C...

2018-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22755#discussion_r229154981
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -459,6 +459,7 @@ private[spark] class Executor(
   threadMXBean.getCurrentThreadCpuTime
 } else 0L
 
+
--- End diff --

unnecessary change


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229155030
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * View
   * If column aliases are not specified in view definition queries, both 
Spark and Hive will
 generate alias names, but in different ways. In order for Spark to be 
able to read views created
-by Hive, users should explicitly specify column aliases in view 
definition queries.
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

It seems Hive 1.x does not allow `(` following `CREATE VIEW ... AS`, while 
Hive 2.x just works well. The following works on Hive 1.2.1, 1.2.2 and 2.3.3.

```
CREATE VIEW v1 AS SELECT c1 + 1, upper(c2) FROM t1;
```

Another finding is that the above view is readable by Spark though view 
column names are weird (`_c0`, `_c1`). Because Spark will add a `Project` 
between `View` and view definition query if their output attributes do not 
match. 

```
spark-sql> explain extended v1;
...
== Analyzed Logical Plan ==
_c0: int, _c1: string
Project [_c0#44, _c1#45]
+- SubqueryAlias v1
   +- View (`default`.`v1`, [_c0#44,_c1#45])
  +- Project [cast((c1 + 1)#48 as int) AS _c0#44, cast(upper(c2)#49 as 
string) AS _c1#45] // this is added by AliasViewChild rule
 +- Project [(c1#46 + 1) AS (c1 + 1)#48, upper(c2#47) AS 
upper(c2)#49]
+- SubqueryAlias t1
   +- HiveTableRelation `default`.`t1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#46, c2#47]
...
```

But, if column aliases in subqueries of the view definition query are 
missing, Spark will not be able to read the view.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22881: [SPARK-25855][CORE] Don't use erasure coding for ...

2018-10-29 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22881#discussion_r229154733
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -471,4 +473,42 @@ object SparkHadoopUtil {
   hadoopConf.set(key.substring("spark.hadoop.".length), value)
 }
   }
+
+
+  lazy val builderReflection: Option[(Class[_], Method, Method)] = Try {
+val cls = Utils.classForName(
+  
"org.apache.hadoop.hdfs.DistributedFileSystem$HdfsDataOutputStreamBuilder")
+(cls, cls.getMethod("replicate"), cls.getMethod("build"))
+  }.toOption
+
+  // scalastyle:off line.size.limit
+  /**
+   * Create a path that uses replication instead of erasure coding, 
regardless of the default
+   * configuration in hdfs for the given path.  This can be helpful as 
hdfs ec doesn't support
--- End diff --

nit: `ec` -> `erasure coding`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...

2018-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22666#discussion_r229153592
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/csv-functions.sql ---
@@ -7,3 +7,11 @@ select from_csv('1', 'a InvalidType');
 select from_csv('1', 'a INT', named_struct('mode', 'PERMISSIVE'));
 select from_csv('1', 'a INT', map('mode', 1));
 select from_csv();
+-- infer schema of json literal
+select from_csv('1,abc', schema_of_csv('1,abc'));
+select schema_of_csv('1|abc', map('delimiter', '|'));
+select schema_of_csv(null);
+CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES 
('1,abc', 'a');
+SELECT schema_of_csv(csvField) FROM csvTable;
+-- Clean up
+DROP VIEW IF EXISTS csvTable;
--- End diff --

yea we need to clean up tables, as they are permanent.

Actually I'm fine with it, as we clean up temp views in a lot of golden 
files. We can have another PR to remove these temp view clean up.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22868: [SPARK-25833][SQL][DOCS] Update migration guide for Hive...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22868
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98243/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22868: [SPARK-25833][SQL][DOCS] Update migration guide for Hive...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22868
  
**[Test build #98243 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98243/testReport)**
 for PR 22868 at commit 
[`26b4dee`](https://github.com/apache/spark/commit/26b4dee5cd8e4e5fe7e1a34543bad17e05b9b783).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22868: [SPARK-25833][SQL][DOCS] Update migration guide for Hive...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22868
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r229151278
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,60 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * This rule applies to conditions in [[Filter]] and [[Join]]. Moreover, 
it transforms predicates
+ * in all [[If]] expressions as well as branch conditions in all 
[[CaseWhen]] expressions.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As this rule is not limited to conditions in [[Filter]] and [[Join]], 
arbitrary plans can
+ * benefit from it. For example, `Project(If(And(cond, Literal(null)), 
Literal(1), Literal(2)))`
+ * can be simplified into `Project(Literal(2))`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case cw @ CaseWhen(branches, _) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+cw.copy(branches = newBranches)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
--- End diff --

Can we make it more general? I think the expected expression is:
1. It's `NullIntolerant`. If any child is null, it will be null.
2. it has a null child.

so I would write something like
```
case f @ Filter(cond, _) if alwaysNull(cond) => f.copy(condition = false)
...

def alwaysNull(e: Expression): Boolean = e match {
  case Literal(null, _) => true
  case n: NullIntolerant => n.children.exists(alwaysNull)
  case _ => false
}

```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22504: [SPARK-25118][Submit] Persist Driver Logs in Client mode...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22504
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22504: [SPARK-25118][Submit] Persist Driver Logs in Client mode...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22504
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98234/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r229150341
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ *
+ * Similarly, the same logic can be applied to conditions in [[Join]], 
predicates in [[If]],
+ * conditions in [[CaseWhen]].
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case CaseWhen(branches, elseValue) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+CaseWhen(newBranches, elseValue)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) =>
+  val newBranches = cw.branches.map { case (cond, value) =>
+replaceNullWithFalse(cond) -> replaceNullWithFalse(value)
+  }
+  val newElseValue = cw.elseValue.map(replaceNullWithFalse)
+  CaseWhen(newBranches, newElseValue)
+case If(pred, trueVal, falseVal) if Seq(trueVal, 
falseVal).forall(isNullOrBoolean) =>
+  If(replaceNullWithFalse(pred), replaceNullWithFalse(trueVal), 
replaceNullWithFalse(falseVal))
+case And(left, right) =>
--- End diff --

I don't have a particular case, this is just to double check that these 
corner cases are considered. I think we are fine now :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22504: [SPARK-25118][Submit] Persist Driver Logs in Client mode...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22504
  
**[Test build #98234 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98234/testReport)**
 for PR 22504 at commit 
[`426beb0`](https://github.com/apache/spark/commit/426beb0f40f69ac1d61915cb2f479300f37c8278).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229150147
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

Good ideas. I have simplified the example. and tested the example above 
using Hive 2.3.3 and Spark 2.3.1.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22868: [SPARK-25833][SQL][DOCS] Update migration guide for Hive...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22868
  
**[Test build #98243 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98243/testReport)**
 for PR 22868 at commit 
[`26b4dee`](https://github.com/apache/spark/commit/26b4dee5cd8e4e5fe7e1a34543bad17e05b9b783).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r229150101
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2578,4 +2578,45 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 Row ("abc", 1))
 }
   }
+
+  test("SPARK-25860: Replace Literal(null, _) with FalseLiteral whenever 
possible") {
+
+def checkPlanIsEmptyLocalScan(df: DataFrame): Unit = 
df.queryExecution.executedPlan match {
--- End diff --

yea we have. Take a look at `TestHive`, and we did something similar before
```
// Disable ConvertToLocalRelation for better test coverage. Test cases 
built on
// LocalRelation will exercise the optimization rules better by disabling 
it as
// this rule may potentially block testing of other optimization rules such 
as
// ConstantPropagation etc.
.set(SQLConf.OPTIMIZER_EXCLUDED_RULES.key, 
ConvertToLocalRelation.ruleName)))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22877: [MINOR][SQL] Avoid hardcoded configuration keys i...

2018-10-29 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22877#discussion_r229148363
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -408,15 +408,16 @@ object SQLConf {
 
   val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = 
buildConf("spark.sql.parquet.filterPushdown.date")
 .doc("If true, enables Parquet filter push-down optimization for Date. 
" +
-  "This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is enabled.")
+  s"This configuration only has an effect when 
'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is " +
--- End diff --

Got it, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22844
  
**[Test build #98242 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98242/testReport)**
 for PR 22844 at commit 
[`2c73385`](https://github.com/apache/spark/commit/2c73385055a9a4357117f6ad6de53112e1f5497b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use m...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22845
  
**[Test build #98241 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98241/testReport)**
 for PR 22845 at commit 
[`3c0eb0a`](https://github.com/apache/spark/commit/3c0eb0a0390c2dd602334b0f0e51d3e9f7c1bbec).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22882: [SPARK-25871][STREAMING][WIP] Don't use EC for streaming...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22882
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22882: [SPARK-25871][STREAMING][WIP] Don't use EC for streaming...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22882
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98229/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22882: [SPARK-25871][STREAMING][WIP] Don't use EC for streaming...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22882
  
**[Test build #98229 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98229/testReport)**
 for PR 22882 at commit 
[`98204e6`](https://github.com/apache/spark/commit/98204e6bcb840f1a47e1a3bd73da5fd7c9b22bcd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22867: [SPARK-25778] WriteAheadLogBackedBlockRDD in YARN Cluste...

2018-10-29 Thread gss2002
Github user gss2002 commented on the issue:

https://github.com/apache/spark/pull/22867
  
@vanzin trying this.. I'll advise shortly

  private val tmpDir = "file:///" + System.getProperty("java.io.tmpdir")



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22877: [MINOR][SQL] Avoid hardcoded configuration keys i...

2018-10-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22877


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22877: [MINOR][SQL] Avoid hardcoded configuration keys in SQLCo...

2018-10-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22877
  
Thanks, @kiszk and @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22146: [SPARK-24434][K8S] pod template files

2018-10-29 Thread erikerlandson
Github user erikerlandson commented on the issue:

https://github.com/apache/spark/pull/22146
  
@mccheah integration testing is passing with the latest container selection 
policy, good to merge?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22867: [SPARK-25778] WriteAheadLogBackedBlockRDD in YARN Cluste...

2018-10-29 Thread gss2002
Github user gss2002 commented on the issue:

https://github.com/apache/spark/pull/22867
  
 History from JIRA and error

WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access 
to HDFS path
due to it using a similar name was $PWD folder from YARN AM Cluster Mode 
for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
following errors
after the driver attempted to recovery the already read data that was being 
written to HDFS
in the checkpoint folder. After spending many hours looking at the cause of 
the following
error below due to the fact the parent folder /hadoop exists in our HDFS 
FS..  I am wonder
if its possible to make an option configurable to choose an alternate bogus 
directory that
will never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> 
streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: 
Launching task 72 on
executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block 
broadcast_4_piece0 as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 
is StorageLevel(disk,
memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in 
memory on ha20t5002dn.tech.hdp.example.com:32768
(size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 
71, ha20t5002dn.tech.hdp.example.com,
executor 1): org.apache.spark.SparkException: Could not read data from 
write ahead log record

FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
denied: user=hdpdevspark,
access=EXECUTE, 
inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
>   at 
org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:307)
>   at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1827)
>   at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:108)
>   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3972)
>   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1130)
>   at 

[GitHub] spark issue #22877: [MINOR][SQL] Avoid hardcoded configuration keys in SQLCo...

2018-10-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22877
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-29 Thread aokolnychyi
Github user aokolnychyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r229133793
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2578,4 +2578,45 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 Row ("abc", 1))
 }
   }
+
+  test("SPARK-25860: Replace Literal(null, _) with FalseLiteral whenever 
possible") {
+
+def checkPlanIsEmptyLocalScan(df: DataFrame): Unit = 
df.queryExecution.executedPlan match {
--- End diff --

Do we actually have a way to enable/disable `ConvertToLocalRelation`? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-29 Thread aokolnychyi
Github user aokolnychyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r229133550
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ *
+ * Similarly, the same logic can be applied to conditions in [[Join]], 
predicates in [[If]],
+ * conditions in [[CaseWhen]].
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case CaseWhen(branches, elseValue) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+CaseWhen(newBranches, elseValue)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) =>
+  val newBranches = cw.branches.map { case (cond, value) =>
+replaceNullWithFalse(cond) -> replaceNullWithFalse(value)
+  }
+  val newElseValue = cw.elseValue.map(replaceNullWithFalse)
+  CaseWhen(newBranches, newElseValue)
+case If(pred, trueVal, falseVal) if Seq(trueVal, 
falseVal).forall(isNullOrBoolean) =>
+  If(replaceNullWithFalse(pred), replaceNullWithFalse(trueVal), 
replaceNullWithFalse(falseVal))
+case And(left, right) =>
--- End diff --

Could you elaborate a bit more on `null && false`?

I had in mind `AND(true, null)` and `OR(false, null)`, which are tricky. 
For example, if we use `AND(true, null)` in SELECT, we will get `null`. 
However, if we use it inside `Filter` or predicate of `If`, it will be 
semantically equivalent to `false` (e.g., `If$eval`). Therefore, the proposed 
rule has a limited scope. I explored the source code & comments in `And/Or` to 
come up with an edge case that wouldn’t work. I could not find such a case. 
To me, it seems safe because the rule is applied only to expressions that 
evaluate to `false` if the underlying expression is `null` (i.e., conditions in 
`Filter`/`Join`, predicates in `If`, conditions in `CaseWhen`). 

Please, let me know if you have a particular case to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22857
  
**[Test build #98240 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98240/testReport)**
 for PR 22857 at commit 
[`4c35955`](https://github.com/apache/spark/commit/4c35955582d3e38722702049780c3c9946a695da).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22771: [SPARK-25773][Core]Cancel zombie tasks in a result stage...

2018-10-29 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/22771
  
@markhamstra any further comments?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22857
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4615/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >