[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203943047
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.avro.generic.GenericDatumReader
+import org.apache.avro.io.{BinaryDecoder, DecoderFactory}
+
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, 
SerializableSchema}
+import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, UnaryExpression}
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
CodeGenerator, ExprCode}
+import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType}
+
+case class AvroDataToCatalyst(child: Expression, avroType: 
SerializableSchema)
+  extends UnaryExpression with ExpectsInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)
+
+  override lazy val dataType: DataType =
+SchemaConverters.toSqlType(avroType.value).dataType
+
+  override def nullable: Boolean = true
+
+  @transient private lazy val reader = new 
GenericDatumReader[Any](avroType.value)
+
+  @transient private lazy val deserializer = new 
AvroDeserializer(avroType.value, dataType)
+
+  @transient private var decoder: BinaryDecoder = _
+
+  @transient private var result: Any = _
+
+  override def nullSafeEval(input: Any): Any = {
+val binary = input.asInstanceOf[Array[Byte]]
+decoder = DecoderFactory.get().binaryDecoder(binary, 0, binary.length, 
decoder)
+result = reader.read(result, decoder)
+deserializer.deserialize(result)
+  }
+
+  override def simpleString: String = {
+s"from_avro(${child.sql}, ${dataType.simpleString})"
--- End diff --

It's not used in `sql`, we can override `sql` here and use untruncated 
version.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21817: [SPARK-24861][SS][test] create corrected temp dir...

2018-07-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21817


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21817: [SPARK-24861][SS][test] create corrected temp directorie...

2018-07-19 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21817
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21822
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1156/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21822
  
**[Test build #93320 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93320/testReport)**
 for PR 21822 at commit 
[`83ffa51`](https://github.com/apache/spark/commit/83ffa51f4b165152dea214be4d73dd518d742a56).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-07-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20146
  
@HyukjinKwon Thanks. Closed and re-opened now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-07-19 Thread viirya
GitHub user viirya reopened a pull request:

https://github.com/apache/spark/pull/20146

[SPARK-11215][ML] Add multiple columns support to StringIndexer

## What changes were proposed in this pull request?

This takes over #19621 to add multi-column support to StringIndexer.

## How was this patch tested?

Added tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-11215

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20146.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20146


commit bb990f1a6511d8ce20f4fff254dfe0ff43262a10
Author: Liang-Chi Hsieh 
Date:   2018-01-03T03:51:59Z

Add multi-column support to StringIndexer.

commit 26cc94bb335cf0ba3bcdbc2b78effd447026792c
Author: Liang-Chi Hsieh 
Date:   2018-01-07T01:42:28Z

Fix glm test.

commit 540c364d2a70ecd6ee5b92fadedc5e9b85026d2c
Author: Liang-Chi Hsieh 
Date:   2018-01-16T08:20:19Z

Merge remote-tracking branch 'upstream/master' into SPARK-11215

commit 18acbbf7b70b87c75ba62be863580fe9accc23b4
Author: Liang-Chi Hsieh 
Date:   2018-01-24T12:03:26Z

Improve test cases.

commit b884fb5c0ce1e627390d08d8425721ea8e4d
Author: Liang-Chi Hsieh 
Date:   2018-01-27T00:58:06Z

Merge remote-tracking branch 'upstream/master' into SPARK-11215

commit 76ff7bf6054a687abd9fc16c8044020a5454d95f
Author: Liang-Chi Hsieh 
Date:   2018-04-19T11:27:21Z

Merge remote-tracking branch 'upstream/master' into SPARK-11215

commit 50af02eaccce7cecb7c3093d5bc14675ca860c22
Author: Liang-Chi Hsieh 
Date:   2018-04-19T11:30:46Z

Change from 2.3 to 2.4.

commit c1be2c7e28ebdfed580577a108d2f254834caed7
Author: Liang-Chi Hsieh 
Date:   2018-04-23T10:15:49Z

Address comments.

commit ed35d875414ba3cf8751a77463f61665e9c373b0
Author: Liang-Chi Hsieh 
Date:   2018-04-23T14:00:16Z

Address comment.

commit a1dcfda85243a1e2210177f2acfb78821c539b17
Author: Liang-Chi Hsieh 
Date:   2018-04-24T06:41:07Z

Use SQL Aggregator for counting string labels.

commit c1685228f7ec7f4904bca67efaee70498b9894c8
Author: Liang-Chi Hsieh 
Date:   2018-04-25T13:28:24Z

Merge remote-tracking branch 'upstream/master' into SPARK-11215

commit a6551b02a10428d66e0dadcfcb5a8da3798ec814
Author: Liang-Chi Hsieh 
Date:   2018-04-26T04:13:09Z

Drop NA values for both frequency and alphabet order types.

commit c003bd3d6c58cf19249ff0ba9dd10140971d655c
Author: Liang-Chi Hsieh 
Date:   2018-07-18T18:58:21Z

Merge remote-tracking branch 'upstream/master' into SPARK-11215




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-07-19 Thread viirya
Github user viirya closed the pull request at:

https://github.com/apache/spark/pull/20146


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21821
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203938957
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala ---
@@ -36,4 +40,27 @@ package object avro {
 @scala.annotation.varargs
 def avro(sources: String*): DataFrame = 
reader.format("avro").load(sources: _*)
   }
+
+  /**
+   * Converts a binary column of avro format into its corresponding 
catalyst value. The specified
+   * schema must match the read data, otherwise the behavior is undefined: 
it may fail or return
+   * arbitrary result.
+   *
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203938964
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.avro.generic.GenericDatumReader
+import org.apache.avro.io.{BinaryDecoder, DecoderFactory}
+
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, 
SerializableSchema}
+import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, UnaryExpression}
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
CodeGenerator, ExprCode}
+import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType}
+
+case class AvroDataToCatalyst(child: Expression, avroType: 
SerializableSchema)
+  extends UnaryExpression with ExpectsInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)
+
+  override lazy val dataType: DataType =
+SchemaConverters.toSqlType(avroType.value).dataType
+
+  override def nullable: Boolean = true
+
+  @transient private lazy val reader = new 
GenericDatumReader[Any](avroType.value)
+
+  @transient private lazy val deserializer = new 
AvroDeserializer(avroType.value, dataType)
+
+  @transient private var decoder: BinaryDecoder = _
+
+  @transient private var result: Any = _
+
+  override def nullSafeEval(input: Any): Any = {
+val binary = input.asInstanceOf[Array[Byte]]
+decoder = DecoderFactory.get().binaryDecoder(binary, 0, binary.length, 
decoder)
+result = reader.read(result, decoder)
+deserializer.deserialize(result)
+  }
+
+  override def simpleString: String = {
+s"from_avro(${child.sql}, ${dataType.simpleString})"
--- End diff --

but this is being used in `sql` though. Do we prefer truncated string form 
in `sql` too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203938936
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala ---
@@ -36,4 +40,27 @@ package object avro {
 @scala.annotation.varargs
 def avro(sources: String*): DataFrame = 
reader.format("avro").load(sources: _*)
   }
+
+  /**
+   * Converts a binary column of avro format into its corresponding 
catalyst value. The specified
+   * schema must match the read data, otherwise the behavior is undefined: 
it may fail or return
+   * arbitrary result.
+   *
+   * @param data the binary column.
+   * @param avroType the avro type.
+   */
+  @Experimental
+  def from_avro(data: Column, avroType: Schema): Column = {
--- End diff --

`Column` can support arbitrary expressions. Personally I don't like these 
string-based overloads...

Since it's in an external module, I think we can only add python/R/SQL 
version when we have a persistent UDF API.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21821
  
**[Test build #93319 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93319/testReport)**
 for PR 21821 at commit 
[`9edc28f`](https://github.com/apache/spark/commit/9edc28fdcb7261f01db716f65e723668a493327e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203938730
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.avro.generic.GenericDatumReader
+import org.apache.avro.io.{BinaryDecoder, DecoderFactory}
+
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, 
SerializableSchema}
+import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, UnaryExpression}
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
CodeGenerator, ExprCode}
+import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType}
+
+case class AvroDataToCatalyst(child: Expression, avroType: 
SerializableSchema)
+  extends UnaryExpression with ExpectsInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)
+
+  override lazy val dataType: DataType =
+SchemaConverters.toSqlType(avroType.value).dataType
+
+  override def nullable: Boolean = true
+
+  @transient private lazy val reader = new 
GenericDatumReader[Any](avroType.value)
+
+  @transient private lazy val deserializer = new 
AvroDeserializer(avroType.value, dataType)
+
+  @transient private var decoder: BinaryDecoder = _
+
+  @transient private var result: Any = _
+
+  override def nullSafeEval(input: Any): Any = {
+val binary = input.asInstanceOf[Array[Byte]]
+decoder = DecoderFactory.get().binaryDecoder(binary, 0, binary.length, 
decoder)
+result = reader.read(result, decoder)
+deserializer.deserialize(result)
+  }
+
+  override def simpleString: String = {
+s"from_avro(${child.sql}, ${dataType.simpleString})"
--- End diff --

IIRC `simpleString` will be used in the plan string and should not be too 
long.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-19 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21823
  
do you know why it's happening? It's super weird that `select key from 
src_cache where positiveNum = 1` can hit the cache but `select key from 
src_cache` can not.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21533: [SPARK-24195][Core] Ignore the files with "local" scheme...

2018-07-19 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/21533
  
Thanks everyone for your help!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-07-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21320
  
@HyukjinKwon We plan to merge this highly desirable feature into Spark 2.4 
release. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

2018-07-19 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/spark/pull/21733
  
Now I'd like to propose changing default behavior to apply new path but 
keeping backward compatibility, so applied it to the patch. I'm still open on 
decision to apply it as advanced option as first approach, and happy to roll 
back when we decide on that way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21820
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21820
  
**[Test build #93317 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93317/testReport)**
 for PR 21820 at commit 
[`725c1b7`](https://github.com/apache/spark/commit/725c1b74d13309a567698b24150d1e7e06662769).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...

2018-07-19 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21824
  
also cc @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21802: [SPARK-23928][SQL] Add shuffle collection function.

2018-07-19 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21802
  
cc @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21809: [SPARK-24851] : Map a Stage ID to it's Associated...

2018-07-19 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21809#discussion_r203936074
  
--- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala 
---
@@ -94,6 +94,13 @@ private[spark] class AppStatusStore(
   }.toSeq
   }
 
+  def getJobIdsAssociatedWithStage(stageId: Int): Seq[Set[Int]] = {
--- End diff --

We don't need to fetch all the stage attempts, just the first of it is 
enough to get all the jobIds.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21818: [SPARK-24860][SQL] Support setting of partitionOv...

2018-07-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21818#discussion_r203934987
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala ---
@@ -486,6 +486,25 @@ class InsertSuite extends DataSourceTest with 
SharedSQLContext {
   }
 }
   }
+  test("SPARK-24860: dynamic partition overwrite specified per source 
without catalog table") {
+withTempPath { path =>
+  Seq((1, 1, 1)).toDF("i", "part1", "part2")
--- End diff --

can we simplify the test? ideally we only need one partition column, and 
write some initial data to the table. then do an overwrite with 
partitionOverwriteMode=dynamic, and another overwrite with 
partitionOverwriteMode=static.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21824
  
**[Test build #93316 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93316/testReport)**
 for PR 21824 at commit 
[`644a30d`](https://github.com/apache/spark/commit/644a30d32e390f39f25515bb251e120e69502f27).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21820
  
**[Test build #93317 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93317/testReport)**
 for PR 21820 at commit 
[`725c1b7`](https://github.com/apache/spark/commit/725c1b74d13309a567698b24150d1e7e06662769).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21818: [SPARK-24860][SQL] Support setting of partitionOv...

2018-07-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21818#discussion_r203934743
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -91,8 +92,12 @@ case class InsertIntoHadoopFsRelationCommand(
 
 val pathExists = fs.exists(qualifiedOutputPath)
 
-val enableDynamicOverwrite =
-  sparkSession.sessionState.conf.partitionOverwriteMode == 
PartitionOverwriteMode.DYNAMIC
+val parameters = CaseInsensitiveMap(options)
+
+val enableDynamicOverwrite = parameters.get("partitionOverwriteMode")
+  .map(mode => PartitionOverwriteMode.withName(mode.toUpperCase))
+  .getOrElse(sparkSession.sessionState.conf.partitionOverwriteMode) ==
+  PartitionOverwriteMode.DYNAMIC
--- End diff --

nit: this is too long
```
val partitionOverwriteMode = parameters.get("partitionOverwriteMode")
val enableDynamicOverwrite = partitionOverwriteMode == 
PartitionOverwriteMode.DYNAMIC
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21824
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1153/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21818: [SPARK-24860][SQL] Support setting of partitionOv...

2018-07-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21818#discussion_r203934766
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala ---
@@ -486,6 +486,25 @@ class InsertSuite extends DataSourceTest with 
SharedSQLContext {
   }
 }
   }
+  test("SPARK-24860: dynamic partition overwrite specified per source 
without catalog table") {
--- End diff --

nit: add a blank file before this new test case


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21824
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat to avoi...

2018-07-19 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/21824
  
cc @mn-mikke @bersprockets


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r203934633
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
 ---
@@ -288,6 +310,27 @@ private[parquet] object ParquetReadSupport {
 }
   }
 
+  /**
+   * Computes the structural intersection between two Parquet group types.
+   */
+  private def intersectParquetGroups(
+  groupType1: GroupType, groupType2: GroupType): Option[GroupType] = {
+val fields =
+  groupType1.getFields.asScala
+.filter(field => groupType2.containsField(field.getName))
+.flatMap {
+  case field1: GroupType =>
--- End diff --

`field1` -> `field`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21824: [SPARK-24871][SQL] Refactor Concat and MapConcat ...

2018-07-19 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/21824

[SPARK-24871][SQL] Refactor Concat and MapConcat to avoid creating 
concatenator object for each row.

## What changes were proposed in this pull request?

Refactor `Concat` and `MapConcat` to:

- avoid creating concatenator object for each row.
- make `Concat` handle `containsNull` properly.
- make `Concat` shortcut if `null` child is found.

## How was this patch tested?

Added some tests and existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark 
issues/SPARK-24871/refactor_concat_mapconcat

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21824.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21824


commit c4d4b28756d93ecc5c1e9a50e705e2a9cb4b8460
Author: Takuya UESHIN 
Date:   2018-07-19T08:32:15Z

Refactor `Concat` and `MapConcat` to avoid creating concatenator for each 
row.

commit 644a30d32e390f39f25515bb251e120e69502f27
Author: Takuya UESHIN 
Date:   2018-07-19T10:38:20Z

Make `Concat` shortcut if null child is found.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21820: [SPARK-24868][PYTHON]add sequence function in Pyt...

2018-07-19 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21820#discussion_r203934505
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2551,6 +2551,27 @@ def map_concat(*cols):
 return Column(jc)
 
 
+@since(2.4)
+def sequence(start, stop, step=None):
+"""
+Generate a sequence of integers from start to stop, incrementing by 
step.
+If step is not set, incrementing by 1 if start is less than or equal 
to stop, otherwise -1.
+
+>>> df1 = spark.createDataFrame([(-2, 2)], ('C1', 'C2'))
+>>> df1.select(sequence('C1', 'C2').alias('r')).collect()
+[Row(r=[-2, -1, 0, 1, 2])]
+>>> df2 = spark.createDataFrame([(4, -4, -2)], ('C1', 'C2', 'C3'))
--- End diff --

It will throw ```java.lang.IllegalArgumentException```. 
There is a check in ```collectionOperations.scala```
```
require(
  (step > num.zero && start <= stop)
|| (step < num.zero && start >= stop)
|| (step == num.zero && start == stop),
  s"Illegal sequence boundaries: $start to $stop by $step")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r203934281
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
 ---
@@ -47,16 +47,25 @@ import org.apache.spark.sql.types._
  *
  * Due to this reason, we no longer rely on [[ReadContext]] to pass 
requested schema from [[init()]]
  * to [[prepareForRead()]], but use a private `var` for simplicity.
+ *
+ * @param parquetMrCompatibility support reading with parquet-mr or 
Spark's built-in Parquet reader
  */
-private[parquet] class ParquetReadSupport(val convertTz: Option[TimeZone])
+private[parquet] class ParquetReadSupport(val convertTz: Option[TimeZone],
+parquetMrCompatibility: Boolean)
 extends ReadSupport[UnsafeRow] with Logging {
   private var catalystRequestedSchema: StructType = _
 
+  /**
+   * Construct a [[ParquetReadSupport]] with [[convertTz]] set to [[None]] 
and
+   * [[parquetMrCompatibility]] set to [[false]].
+   *
+   * We need a zero-arg constructor for SpecificParquetRecordReaderBase.  
But that is only
+   * used in the vectorized reader, where we get the convertTz value 
directly, and the value here
+   * is ignored. Further, we set [[parquetMrCompatibility]] to [[false]] 
as this constructor is only
+   * called by the Spark reader.
--- End diff --

re: 
https://github.com/apache/spark/pull/21320/files/cb858f202e49d69f2044681e37f982dc10676296#r199631341
 actually, it doesn't looks clear to me too. What does the flag indicate? you 
mean normal parquet reader vs vectorized parquet reader?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r203933423
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/planning/SelectedFieldSuite.scala
 ---
@@ -0,0 +1,387 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.exceptions.TestFailedException
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.NamedExpression
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types._
+
+// scalastyle:off line.size.limit
+class SelectedFieldSuite extends SparkFunSuite with BeforeAndAfterAll {
+  // The test schema as a tree string, i.e. `schema.treeString`
+  // root
+  //  |-- col1: string (nullable = false)
+  //  |-- col2: struct (nullable = true)
+  //  ||-- field1: integer (nullable = true)
+  //  ||-- field2: array (nullable = true)
+  //  |||-- element: integer (containsNull = false)
+  //  ||-- field3: array (nullable = false)
+  //  |||-- element: struct (containsNull = true)
+  //  ||||-- subfield1: integer (nullable = true)
+  //  ||||-- subfield2: integer (nullable = true)
+  //  ||||-- subfield3: array (nullable = true)
+  //  |||||-- element: integer (containsNull = true)
+  //  ||-- field4: map (nullable = true)
+  //  |||-- key: string
+  //  |||-- value: struct (valueContainsNull = false)
+  //  ||||-- subfield1: integer (nullable = true)
+  //  ||||-- subfield2: array (nullable = true)
+  //  |||||-- element: integer (containsNull = false)
+  //  ||-- field5: array (nullable = false)
+  //  |||-- element: struct (containsNull = true)
+  //  ||||-- subfield1: struct (nullable = false)
+  //  |||||-- subsubfield1: integer (nullable = true)
+  //  |||||-- subsubfield2: integer (nullable = true)
+  //  ||||-- subfield2: struct (nullable = true)
+  //  |||||-- subsubfield1: struct (nullable = true)
+  //  ||||||-- subsubsubfield1: string (nullable = 
true)
+  //  |||||-- subsubfield2: integer (nullable = true)
+  //  ||-- field6: struct (nullable = true)
+  //  |||-- subfield1: string (nullable = false)
+  //  |||-- subfield2: string (nullable = true)
+  //  ||-- field7: struct (nullable = true)
+  //  |||-- subfield1: struct (nullable = true)
+  //  ||||-- subsubfield1: integer (nullable = true)
+  //  ||||-- subsubfield2: integer (nullable = true)
+  //  ||-- field8: map (nullable = true)
+  //  |||-- key: string
+  //  |||-- value: array (valueContainsNull = false)
+  //  ||||-- element: struct (containsNull = true)
+  //  |||||-- subfield1: integer (nullable = true)
+  //  |||||-- subfield2: array (nullable = true)
+  //  ||||||-- element: integer (containsNull = false)
+  //  ||-- field9: map (nullable = true)
+  //  |||-- key: string
+  //  |||-- value: integer (valueContainsNull = false)
+  //  |-- col3: array (nullable = false)
+  //  ||-- element: struct (containsNull = false)
+  //  |||-- field1: struct (nullable = true)
+  //  ||||-- subfield1: integer (nullable = false)
+  //  ||||-- subfield2: integer (nullable = true)
+  //  |||-- field2: map (nullable = true)
+  //  ||||-- key: string
+  //  ||||-- value: integer 

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r203933307
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.execution.FileSchemaPruningTest
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class ParquetSchemaPruningSuite
+extends QueryTest
+with ParquetTest
+with FileSchemaPruningTest
+with SharedSQLContext {
--- End diff --

Can we just simply:

```scala
  override def beforeEach(): Unit = {
super.beforeEach()
spark.conf.set(SQLConf. NESTED_SCHEMA_PRUNING_ENABLED.key, "true")
  }

  override def afterEach(): Unit = {
try {
  spark.conf.unset(SQLConf.NESTED_SCHEMA_PRUNING_ENABLED.key)
} finally {
  super.afterEach()
}
  }
```

without the complicated hierarchy?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21653
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93302/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21733
  
**[Test build #93315 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93315/testReport)**
 for PR 21733 at commit 
[`977428c`](https://github.com/apache/spark/commit/977428cb35a6fc0a9fa7a0ca1a51e39a94447a01).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21653
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r203931755
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/ProjectionOverSchema.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+/**
+ * A Scala extractor that projects an expression over a given schema. Data 
types,
+ * field indexes and field counts of complex type extractors and attributes
+ * are adjusted to fit the schema. All other expressions are left as-is. 
This
+ * class is motivated by columnar nested schema pruning.
+ */
+case class ProjectionOverSchema(schema: StructType) {
--- End diff --

This still looks weird that we place this under `catalyst` since we 
currently only use it under `execution`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21813
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93304/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21813
  
**[Test build #93304 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93304/testReport)**
 for PR 21813 at commit 
[`e0c57f7`](https://github.com/apache/spark/commit/e0c57f73ec4c3e24e4af107cb457c9b1c13f4174).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r203931060
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/ProjectionOverSchema.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+/**
+ * A Scala extractor that projects an expression over a given schema. Data 
types,
+ * field indexes and field counts of complex type extractors and attributes
+ * are adjusted to fit the schema. All other expressions are left as-is. 
This
+ * class is motivated by columnar nested schema pruning.
+ */
+case class ProjectionOverSchema(schema: StructType) {
+  private val fieldNames = schema.fieldNames.toSet
+
+  def unapply(expr: Expression): Option[Expression] = getProjection(expr)
+
+  private def getProjection(expr: Expression): Option[Expression] =
+expr match {
+  case a @ AttributeReference(name, _, _, _) if 
(fieldNames.contains(name)) =>
+Some(a.copy(dataType = schema(name).dataType)(a.exprId, 
a.qualifier))
+  case GetArrayItem(child, arrayItemOrdinal) =>
+getProjection(child).map {
+  case projection =>
+GetArrayItem(projection, arrayItemOrdinal)
--- End diff --

nit:

```
getProjection(child).map { projection => GetArrayItem(projection, 
arrayItemOrdinal) }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r203931030
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/ProjectionOverSchema.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+/**
+ * A Scala extractor that projects an expression over a given schema. Data 
types,
+ * field indexes and field counts of complex type extractors and attributes
+ * are adjusted to fit the schema. All other expressions are left as-is. 
This
+ * class is motivated by columnar nested schema pruning.
+ */
+case class ProjectionOverSchema(schema: StructType) {
+  private val fieldNames = schema.fieldNames.toSet
+
+  def unapply(expr: Expression): Option[Expression] = getProjection(expr)
+
+  private def getProjection(expr: Expression): Option[Expression] =
+expr match {
+  case a @ AttributeReference(name, _, _, _) if 
(fieldNames.contains(name)) =>
+Some(a.copy(dataType = schema(name).dataType)(a.exprId, 
a.qualifier))
+  case GetArrayItem(child, arrayItemOrdinal) =>
+getProjection(child).map {
+  case projection =>
+GetArrayItem(projection, arrayItemOrdinal)
+}
+  case GetArrayStructFields(child, StructField(name, _, _, _), _, 
numFields, containsNull) =>
+getProjection(child).map(p => (p, p.dataType)).map {
+  case (projection, ArrayType(projSchema @ StructType(_), _)) =>
+GetArrayStructFields(projection,
+  projSchema(name), projSchema.fieldIndex(name), 
projSchema.size, containsNull)
+}
+  case GetMapValue(child, key) =>
+getProjection(child).map {
+  case projection =>
--- End diff --

nit:

```scala
getProjection(child).map { projection => GetMapValue(projection, key) }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21822
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21822
  
**[Test build #93310 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93310/testReport)**
 for PR 21822 at commit 
[`738e99c`](https://github.com/apache/spark/commit/738e99c615f45cfb6e5882a822dafcb8472b78ea).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21822
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93310/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21474: [SPARK-24297][CORE] Fetch-to-disk by default for > 2gb

2018-07-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/21474
  
I will take a look at this sometime day, but don't block on me if it is 
urgent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21533: [SPARK-24195][Core] Ignore the files with "local"...

2018-07-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21533


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21533: [SPARK-24195][Core] Ignore the files with "local" scheme...

2018-07-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/21533
  
Merging to master branch. Thanks all!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21533: [SPARK-24195][Core] Ignore the files with "local" scheme...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21533
  
SGTM too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21821
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93309/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21821
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21772
  
**[Test build #93314 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93314/testReport)**
 for PR 21772 at commit 
[`f67ff4d`](https://github.com/apache/spark/commit/f67ff4d91764e67f811ebfb7c02ab466153a5b02).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...

2018-07-19 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/21772
  
Jenkins, test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21820: [SPARK-24868][PYTHON]add sequence function in Pyt...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21820#discussion_r203924999
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2551,6 +2551,27 @@ def map_concat(*cols):
 return Column(jc)
 
 
+@since(2.4)
+def sequence(start, stop, step=None):
+"""
+Generate a sequence of integers from start to stop, incrementing by 
step.
+If step is not set, incrementing by 1 if start is less than or equal 
to stop, otherwise -1.
+
+>>> df1 = spark.createDataFrame([(-2, 2)], ('C1', 'C2'))
+>>> df1.select(sequence('C1', 'C2').alias('r')).collect()
+[Row(r=[-2, -1, 0, 1, 2])]
+>>> df2 = spark.createDataFrame([(4, -4, -2)], ('C1', 'C2', 'C3'))
--- End diff --

What happens if the `step` is positive in this case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21813
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21813
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1152/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/21822#discussion_r203924609
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -533,7 +537,8 @@ trait CheckAnalysis extends PredicateHelper {
 
 // Simplify the predicates before validating any unsupported 
correlation patterns
 // in the plan.
-BooleanSimplification(sub).foreachUp {
+// TODO(rxin): Why did this need to call BooleanSimplification???
--- End diff --

@rxin From what i remember Reynold, most of this logic was housed in 
Analyzer before and we moved it to optimizer. In the old code we used to walk 
the plan after simplifying the predicates. The comment used to read "Simplify 
the predicates before pulling them out.". I just retained that semantics.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21767: SPARK-24804 There are duplicate words in the test title ...

2018-07-19 Thread httfighter
Github user httfighter commented on the issue:

https://github.com/apache/spark/pull/21767
  
Thank you for your comments, @srowen @HyukjinKwon @wangyum. I will try to 
contribute more valuable issues!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21746: [SPARK-24699] [SS]Make watermarks work with Trigger.Once...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21746
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93300/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21746: [SPARK-24699] [SS]Make watermarks work with Trigger.Once...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21746
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21746: [SPARK-24699] [SS]Make watermarks work with Trigger.Once...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21746
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21746: [SPARK-24699] [SS]Make watermarks work with Trigger.Once...

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21746
  
**[Test build #93300 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93300/testReport)**
 for PR 21746 at commit 
[`584c96e`](https://github.com/apache/spark/commit/584c96e5e8ecbeeb4ae4eefe1184606173b83ade).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21823
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1151/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21823
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21823: [SPARK-24870][SQL]Cache can't work normally if there are...

2018-07-19 Thread eatoncys
Github user eatoncys commented on the issue:

https://github.com/apache/spark/pull/21823
  
cc @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203923037
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala ---
@@ -36,4 +40,27 @@ package object avro {
 @scala.annotation.varargs
 def avro(sources: String*): DataFrame = 
reader.format("avro").load(sources: _*)
   }
+
+  /**
+   * Converts a binary column of avro format into its corresponding 
catalyst value. The specified
+   * schema must match the read data, otherwise the behavior is undefined: 
it may fail or return
+   * arbitrary result.
+   *
--- End diff --

Shall we add `@since`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21823: [SPARK-24870][SQL]Cache can't work normally if th...

2018-07-19 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/21823

[SPARK-24870][SQL]Cache can't work normally if there are case letters in SQL

## What changes were proposed in this pull request?
Modified the canonicalized to not case-insensitive.
Before the PR, cache can't work normally if there are case letters in SQL, 
for example:
 sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING 
hive")

sql("select key, sum(case when Key > 0 then 1 else 0 end) as 
positiveNum " +
  "from src group by key").cache().createOrReplaceTempView("src_cache")
sql(
  s"""select a.key
   from
   (select key from src_cache where positiveNum = 1)a
   left join
   (select key from src_cache )b
   on a.key=b.key
""").explain

The physical plan of the sql is:

![image](https://user-images.githubusercontent.com/26834091/42979518-3decf0fa-8c05-11e8-9837-d5e4c334cb1f.png)

The subquery "select key from src_cache where positiveNum = 1" on the left 
of join can use the cache data, but the subquery "select key from src_cache" on 
the right of join cannot use the cache data.

## How was this patch tested?

new added test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark canonicalized

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21823.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21823


commit 2b2a5a33ed58ce07fd2515eb01e80acbedeb8b2a
Author: 10129659 
Date:   2018-07-20T01:43:53Z

Cache can't work normally if there are case letters in SQL




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203922532
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala ---
@@ -36,4 +40,27 @@ package object avro {
 @scala.annotation.varargs
 def avro(sources: String*): DataFrame = 
reader.format("avro").load(sources: _*)
   }
+
+  /**
+   * Converts a binary column of avro format into its corresponding 
catalyst value. The specified
+   * schema must match the read data, otherwise the behavior is undefined: 
it may fail or return
+   * arbitrary result.
+   *
+   * @param data the binary column.
+   * @param avroType the avro type.
+   */
+  @Experimental
+  def from_avro(data: Column, avroType: Schema): Column = {
--- End diff --

Why don't we just expose String version (json) in case we expose it in SQL 
syntax? I am sure we won't want to have many variants in the future as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203922243
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.avro.generic.GenericDatumReader
+import org.apache.avro.io.{BinaryDecoder, DecoderFactory}
+
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, 
SerializableSchema}
+import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, UnaryExpression}
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
CodeGenerator, ExprCode}
+import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType}
+
+case class AvroDataToCatalyst(child: Expression, avroType: 
SerializableSchema)
+  extends UnaryExpression with ExpectsInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)
+
+  override lazy val dataType: DataType =
+SchemaConverters.toSqlType(avroType.value).dataType
+
+  override def nullable: Boolean = true
+
+  @transient private lazy val reader = new 
GenericDatumReader[Any](avroType.value)
+
+  @transient private lazy val deserializer = new 
AvroDeserializer(avroType.value, dataType)
+
+  @transient private var decoder: BinaryDecoder = _
+
+  @transient private var result: Any = _
+
+  override def nullSafeEval(input: Any): Any = {
+val binary = input.asInstanceOf[Array[Byte]]
+decoder = DecoderFactory.get().binaryDecoder(binary, 0, binary.length, 
decoder)
+result = reader.read(result, decoder)
+deserializer.deserialize(result)
+  }
+
+  override def simpleString: String = {
+s"from_avro(${child.sql}, ${dataType.simpleString})"
--- End diff --

Shall we use catalogString for datatype?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21775: [SPARK-24812][SQL] Last Access Time in the table ...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21775#discussion_r203921304
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -2250,6 +2251,22 @@ class HiveDDLSuite
 }
   }
 
+  test("desc formatted table for last access verification") {
--- End diff --

let's name it `SPARK-24812: desc formatted table for last access 
verification`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21775: [SPARK-24812][SQL] Last Access Time in the table descrip...

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21775
  
**[Test build #93311 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93311/testReport)**
 for PR 21775 at commit 
[`b527fdc`](https://github.com/apache/spark/commit/b527fdc5919296ffa12e1be54367b9132ecee61e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21775: [SPARK-24812][SQL] Last Access Time in the table ...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21775#discussion_r203920736
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -2250,6 +2251,22 @@ class HiveDDLSuite
 }
   }
 
+  test("desc formatted table for last access verification") {
+withTable("t1") {
+  sql(s"create table" +
+s" if not exists t1 (c1_int int, c2_string string, c3_float 
float)")
--- End diff --

nit:

```scala
sql(
"CREATE TABLE IF NOT EXISTS t1 (c1_int INT, c2_string STRING, 
c3_float FLOAT)")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21775: [SPARK-24812][SQL] Last Access Time in the table descrip...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21775
  
Seems making sense.

> seems to be a limitation as of now even from hive, better we can follow 
the hive behavior unless the limitation has been resolved from hive.

Do you maybe know related Hive side ticket or can you point me out the 
related codes?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21775: [SPARK-24812][SQL] Last Access Time in the table descrip...

2018-07-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21775
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21822#discussion_r203918981
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -533,7 +537,8 @@ trait CheckAnalysis extends PredicateHelper {
 
 // Simplify the predicates before validating any unsupported 
correlation patterns
 // in the plan.
-BooleanSimplification(sub).foreachUp {
+// TODO(rxin): Why did this need to call BooleanSimplification???
--- End diff --

cc @dilipbiswal @cloud-fan @gatorsmile 

Why did we need BooleanSimplification here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syn...

2018-07-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21813#discussion_r203918259
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -442,17 +442,35 @@ class Analyzer(
 child: LogicalPlan): LogicalPlan = {
   val gid = AttributeReference(VirtualColumn.groupingIdName, 
IntegerType, false)()
 
+  // In case of ANSI-SQL compliant syntax for GROUPING SETS, 
groupByExprs is optional and
+  // can be null. In such case, we derive the groupByExprs from the 
user supplied values for
+  // grouping sets.
+  val finalGroupByExpressions = if (groupByExprs == Nil) {
+selectedGroupByExprs.flatten.foldLeft(Seq.empty[Expression]) { 
(result, currentExpr) =>
--- End diff --

Can we have a test case for it too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syn...

2018-07-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21813#discussion_r203918204
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/grouping_set.sql ---
@@ -13,5 +13,39 @@ SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c 
GROUPING SETS ((a));
 -- SPARK-17849: grouping set throws NPE #3
 SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c GROUPING SETS 
((c));
 
+-- Group sets without explicit group by
+SELECT c1, sum(c2) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) AS t (c1, c2, 
c3) GROUP BY GROUPING SETS (c1);
 
+-- Group sets without group by and with grouping
+SELECT c1, sum(c2), grouping(c1) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) 
AS t (c1, c2, c3) GROUP BY GROUPING SETS (c1);
+
+-- Mutiple grouping within a grouping set
+SELECT c1, c2, Sum(c3), grouping__id
+FROM   (VALUES ('x', 'a', 10), ('y', 'b', 20) ) AS t (c1, c2, c3)
+GROUP  BY GROUPING SETS ( ( c1 ), ( c2 ) )
+HAVING GROUPING__ID > 1;
+
+-- Group sets without explicit group by
+SELECT grouping(c1) FROM (VALUES ('x', 'a', 10), ('y', 'b', 20)) AS t (c1, 
c2, c3) GROUP BY c1,c2 GROUPING SETS (c1,c2);
+
+-- Mutiple grouping within a grouping set
+SELECT -c1 AS c1 FROM (values (1,2), (3,2)) t(c1, c2) GROUP BY GROUPING 
SETS ((c1), (c1, c2));
+
+-- complex expression in grouping sets
+SELECT a + b, b, sum(c) FROM (VALUES (1,1,1),(2,2,2)) AS t(a,b,c) GROUP BY 
GROUPING SETS ( (a + b), (b));
+
+-- complex expression in grouping sets
+SELECT a + b, b, sum(c) FROM (VALUES (1,1,1),(2,2,2)) AS t(a,b,c) GROUP BY 
GROUPING SETS ( (a + b), (b + a), (b));
+
+-- more query constructs with grouping sets
+SELECT c1 AS col1, c2 AS col2
+FROM   (VALUES (1, 2), (3, 2)) t(c1, c2)
+GROUP  BY GROUPING SETS ( ( c1 ), ( c1, c2 ) )
+HAVING col2 IS NOT NULL
+ORDER  BY -col1;
--- End diff --

Not at all. @dilipbiswal :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syn...

2018-07-19 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/21813#discussion_r203917186
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -442,17 +442,35 @@ class Analyzer(
 child: LogicalPlan): LogicalPlan = {
   val gid = AttributeReference(VirtualColumn.groupingIdName, 
IntegerType, false)()
 
+  // In case of ANSI-SQL compliant syntax for GROUPING SETS, 
groupByExprs is optional and
+  // can be null. In such case, we derive the groupByExprs from the 
user supplied values for
+  // grouping sets.
+  val finalGroupByExpressions = if (groupByExprs == Nil) {
+selectedGroupByExprs.flatten.foldLeft(Seq.empty[Expression]) { 
(result, currentExpr) =>
--- End diff --

@viirya No. We should be getting an error as we don't have a group by 
specification.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syn...

2018-07-19 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/21813#discussion_r203917039
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/grouping_set.sql ---
@@ -13,5 +13,39 @@ SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c 
GROUPING SETS ((a));
 -- SPARK-17849: grouping set throws NPE #3
 SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c GROUPING SETS 
((c));
 
+-- Group sets without explicit group by
+SELECT c1, sum(c2) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) AS t (c1, c2, 
c3) GROUP BY GROUPING SETS (c1);
 
+-- Group sets without group by and with grouping
+SELECT c1, sum(c2), grouping(c1) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) 
AS t (c1, c2, c3) GROUP BY GROUPING SETS (c1);
+
+-- Mutiple grouping within a grouping set
+SELECT c1, c2, Sum(c3), grouping__id
+FROM   (VALUES ('x', 'a', 10), ('y', 'b', 20) ) AS t (c1, c2, c3)
+GROUP  BY GROUPING SETS ( ( c1 ), ( c2 ) )
+HAVING GROUPING__ID > 1;
+
+-- Group sets without explicit group by
+SELECT grouping(c1) FROM (VALUES ('x', 'a', 10), ('y', 'b', 20)) AS t (c1, 
c2, c3) GROUP BY c1,c2 GROUPING SETS (c1,c2);
+
+-- Mutiple grouping within a grouping set
+SELECT -c1 AS c1 FROM (values (1,2), (3,2)) t(c1, c2) GROUP BY GROUPING 
SETS ((c1), (c1, c2));
+
+-- complex expression in grouping sets
+SELECT a + b, b, sum(c) FROM (VALUES (1,1,1),(2,2,2)) AS t(a,b,c) GROUP BY 
GROUPING SETS ( (a + b), (b));
+
+-- complex expression in grouping sets
+SELECT a + b, b, sum(c) FROM (VALUES (1,1,1),(2,2,2)) AS t(a,b,c) GROUP BY 
GROUPING SETS ( (a + b), (b + a), (b));
+
+-- more query constructs with grouping sets
+SELECT c1 AS col1, c2 AS col2
+FROM   (VALUES (1, 2), (3, 2)) t(c1, c2)
+GROUP  BY GROUPING SETS ( ( c1 ), ( c1, c2 ) )
+HAVING col2 IS NOT NULL
+ORDER  BY -col1;
--- End diff --

@viirya Sorry Simon.. do i have to do something for this comment ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21822
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21821
  
**[Test build #93309 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93309/testReport)**
 for PR 21821 at commit 
[`4030e17`](https://github.com/apache/spark/commit/4030e174e6efeeae6158f13675234ed994276371).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21821
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21821
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1149/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21813
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93299/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21813
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syn...

2018-07-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21813#discussion_r203914517
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/grouping_set.sql ---
@@ -13,5 +13,39 @@ SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c 
GROUPING SETS ((a));
 -- SPARK-17849: grouping set throws NPE #3
 SELECT a, b, c, count(d) FROM grouping GROUP BY a, b, c GROUPING SETS 
((c));
 
+-- Group sets without explicit group by
+SELECT c1, sum(c2) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) AS t (c1, c2, 
c3) GROUP BY GROUPING SETS (c1);
 
+-- Group sets without group by and with grouping
+SELECT c1, sum(c2), grouping(c1) FROM (VALUES ('x', 10, 0), ('y', 20, 0)) 
AS t (c1, c2, c3) GROUP BY GROUPING SETS (c1);
+
+-- Mutiple grouping within a grouping set
+SELECT c1, c2, Sum(c3), grouping__id
+FROM   (VALUES ('x', 'a', 10), ('y', 'b', 20) ) AS t (c1, c2, c3)
+GROUP  BY GROUPING SETS ( ( c1 ), ( c2 ) )
+HAVING GROUPING__ID > 1;
+
+-- Group sets without explicit group by
+SELECT grouping(c1) FROM (VALUES ('x', 'a', 10), ('y', 'b', 20)) AS t (c1, 
c2, c3) GROUP BY c1,c2 GROUPING SETS (c1,c2);
--- End diff --

Doesn't this have explicit group by?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21652
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21652
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93298/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21813: [SPARK-24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21813
  
**[Test build #93299 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93299/testReport)**
 for PR 21813 at commit 
[`7cf187d`](https://github.com/apache/spark/commit/7cf187db02a54bcfd3b44e0710d95462b273ea97).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21652: [SPARK-24551][K8S] Add integration tests for secrets

2018-07-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21652
  
**[Test build #93298 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93298/testReport)**
 for PR 21652 at commit 
[`67df340`](https://github.com/apache/spark/commit/67df340d943d38afd1ea4c12c02b417b5434970f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-19 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21440#discussion_r203914040
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -659,6 +659,11 @@ private[spark] class BlockManager(
* Get block from remote block managers as serialized bytes.
*/
   def getRemoteBytes(blockId: BlockId): Option[ChunkedByteBuffer] = {
+// TODO if we change this method to return the ManagedBuffer, then 
getRemoteValues
+// could just use the inputStream on the temp file, rather than 
memory-mapping the file.
+// Until then, replication can cause the process to use too much 
memory and get killed
+// by the OS / cluster manager (not a java OOM, since its a 
memory-mapped file) even though
+// we've read the data to disk.
--- End diff --

I see. I agree with you that YARN could have some issues in calculating the 
exact memory usage.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-07-19 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/21546
  
>@BryanCutler, this takes longer then I thought. Will complete my review 
till this weekend. For clarification, still no objection about merging it in 
orthogonally with my review.

No problem, thanks @HyukjinKwon !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21822
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93308/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

2018-07-19 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r203913178
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3349,20 +3385,20 @@ class Dataset[T] private[sql](
 }
   }
 
-  /** Convert to an RDD of ArrowPayload byte arrays */
-  private[sql] def toArrowPayload(plan: SparkPlan): RDD[ArrowPayload] = {
+  /** Convert to an RDD of serialized ArrowRecordBatches. */
+  private[sql] def getArrowBatchRdd(plan: SparkPlan): RDD[Array[Byte]] = {
--- End diff --

Yeah, I can't remember why I changed it.. but I think you're right it so 
I'll change it back.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21822
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93307/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...

2018-07-19 Thread hthuynh2
Github user hthuynh2 commented on the issue:

https://github.com/apache/spark/pull/21653
  
@tgravescs I updated it. Can you please have a look at it when you have 
time. Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21822: [SPARK-24865] Remove AnalysisBarrier - WIP

2018-07-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21822
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93306/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >