from:"yhuai"

[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-17 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/21939
  
got it. Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-17 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/21939
  
@shaneknapp what was the version of pyarrow in that build? 0.8 or 0.10?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....

2018-08-17 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/21939
  
@BryanCutler  So, for this upgrade, even the JVM side dependency is 0.10, 
pyspark can work with any version between pyarrow 0.8 to 0.10 without problem?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22003: [SPARK-25019][BUILD] Fix orc dependency to use the same ...

2018-08-06 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/22003
  
@dongjoon-hyun  no problem. Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22003: [SPARK-25019][BUILD] Fix orc dependency to use the same ...

2018-08-06 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/22003
  
lgtm. Merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22003: [SPARK-25019][BUILD] Fix orc dependency to use th...

2018-08-06 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22003#discussion_r207986831
  
--- Diff: sql/core/pom.xml ---
@@ -90,39 +90,11 @@
   org.apache.orc
   orc-core
   ${orc.classifier}
-  
-
-  org.apache.hadoop
-  hadoop-hdfs
-
-
-
-  org.apache.hive
-  hive-storage-api
-
-  
 
 
   org.apache.orc
   orc-mapreduce
   ${orc.classifier}
-  
-
-  org.apache.hadoop
-  hadoop-hdfs
-
-
-
-  org.apache.hive
-  hive-storage-api
-
-  
--- End diff --

got it. Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22003: [SPARK-25019][BUILD] Fix orc dependency to use th...

2018-08-06 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22003#discussion_r207962501
  
--- Diff: sql/core/pom.xml ---
@@ -90,39 +90,11 @@
   org.apache.orc
   orc-core
   ${orc.classifier}
-  
-
-  org.apache.hadoop
-  hadoop-hdfs
-
-
-
-  org.apache.hive
-  hive-storage-api
-
-  
 
 
   org.apache.orc
   orc-mapreduce
   ${orc.classifier}
-  
-
-  org.apache.hadoop
-  hadoop-hdfs
-
-
-
-  org.apache.hive
-  hive-storage-api
-
-  
--- End diff --

Thank you. Just for me to understand it better. Do you know why defining 
exclusions in this pom file messed up the pom?

Also, how should I try it out myself? What is the right command to publish 
locally?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22003: [SPARK-25019][BUILD] Fix orc dependency to use th...

2018-08-06 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22003#discussion_r207888608
  
--- Diff: sql/core/pom.xml ---
@@ -90,39 +90,11 @@
   org.apache.orc
   orc-core
   ${orc.classifier}
-  
-
-  org.apache.hadoop
-  hadoop-hdfs
-
-
-
-  org.apache.hive
-  hive-storage-api
-
-  
 
 
   org.apache.orc
   orc-mapreduce
   ${orc.classifier}
-  
-
-  org.apache.hadoop
-  hadoop-hdfs
-
-
-
-  org.apache.hive
-  hive-storage-api
-
-  
--- End diff --

@dongjoon-hyun when we publish snapshot artifacts or releases, will the pom 
for spark sql get all of exclusions defined in the parent pom?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21865: [SPARK-24895] Remove spotbugs plugin

2018-07-24 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/21865
  
lgtm. I am merging this PR to master branch. Then, I will kick off 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21865: [SPARK-24895] Remove spotbugs plugin

2018-07-24 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/21865
  
cc @HyukjinKwon @kiszk 

I will merge this PR once it passes the test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20473: [SPARK-23300][TESTS] Prints out if Pandas and PyA...

2018-02-01 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/20473#discussion_r16362
  
--- Diff: python/run-tests.py ---
@@ -151,6 +151,38 @@ def parse_opts():
 return opts
 
 
+def _check_dependencies(python_exec, modules_to_test):
+if "COVERAGE_PROCESS_START" in os.environ:
+# Make sure if coverage is installed.
+try:
+subprocess_check_output(
+[python_exec, "-c", "import coverage"],
+stderr=open(os.devnull, 'w'))
+except:
+print_red("Coverage is not installed in Python executable '%s' 
"
+  "but 'COVERAGE_PROCESS_START' environment variable 
is set, "
+  "exiting." % python_exec)
+sys.exit(-1)
+
+if pyspark_sql in modules_to_test:
+# If we should test 'pyspark-sql', it checks if PyArrow and Pandas 
are installed and
+# explicitly prints out. See SPARK-23300.
+try:
+subprocess_check_output(
+[python_exec, "-c", "import pyarrow"],
+stderr=open(os.devnull, 'w'))
+except:
--- End diff --

Thank you. Appreciate it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-02-01 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r165449847
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 object PhysicalAggregation {
   // groupingExpressions, aggregateExpressions, resultExpressions, child
   type ReturnType =
-(Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], 
LogicalPlan)
+(Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], 
LogicalPlan)
--- End diff --

It will be good to try it out soon. But it is not urgent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20473: [SPARK-23300][TESTS] Prints out if Pandas and PyA...

2018-02-01 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/20473#discussion_r165445947
  
--- Diff: python/run-tests.py ---
@@ -151,6 +151,38 @@ def parse_opts():
 return opts
 
 
+def _check_dependencies(python_exec, modules_to_test):
+if "COVERAGE_PROCESS_START" in os.environ:
+# Make sure if coverage is installed.
+try:
+subprocess_check_output(
+[python_exec, "-c", "import coverage"],
+stderr=open(os.devnull, 'w'))
+except:
+print_red("Coverage is not installed in Python executable '%s' 
"
+  "but 'COVERAGE_PROCESS_START' environment variable 
is set, "
+  "exiting." % python_exec)
+sys.exit(-1)
+
+if pyspark_sql in modules_to_test:
+# If we should test 'pyspark-sql', it checks if PyArrow and Pandas 
are installed and
+# explicitly prints out. See SPARK-23300.
+try:
+subprocess_check_output(
+[python_exec, "-c", "import pyarrow"],
+stderr=open(os.devnull, 'w'))
+except:
--- End diff --

Actually, since we are here, is it possible to do the same thing as 
https://github.com/apache/spark/blob/ec63e2d0743a4f75e1cce21d0fe2b54407a86a4a/python/pyspark/sql/tests.py#L51-L63
 and 
https://github.com/apache/spark/blob/ec63e2d0743a4f75e1cce21d0fe2b54407a86a4a/python/pyspark/sql/tests.py#L78-L84?

It will be nice to use the same logic. Otherwise, even we do not print the 
warning at here, tests may still get skipped because of the version issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20473: [SPARK-23300][TESTS] Prints out if Pandas and PyA...

2018-02-01 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/20473#discussion_r165445232
  
--- Diff: python/run-tests.py ---
@@ -151,6 +151,38 @@ def parse_opts():
 return opts
 
 
+def _check_dependencies(python_exec, modules_to_test):
+if "COVERAGE_PROCESS_START" in os.environ:
+# Make sure if coverage is installed.
+try:
+subprocess_check_output(
+[python_exec, "-c", "import coverage"],
+stderr=open(os.devnull, 'w'))
+except:
+print_red("Coverage is not installed in Python executable '%s' 
"
+  "but 'COVERAGE_PROCESS_START' environment variable 
is set, "
+  "exiting." % python_exec)
+sys.exit(-1)
+
+if pyspark_sql in modules_to_test:
+# If we should test 'pyspark-sql', it checks if PyArrow and Pandas 
are installed and
+# explicitly prints out. See SPARK-23300.
+try:
+subprocess_check_output(
+[python_exec, "-c", "import pyarrow"],
+stderr=open(os.devnull, 'w'))
+except:
--- End diff --

How about we also explicitly mention that pyarrow/pandas related tests will 
run if they are installed?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

2018-01-31 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/20465
  
So, jenkins jobs run those tests with python3? If so, I feel better because 
those tests are not completely skipped in Jenkins. If it is hard to make them 
run with python 2. Letâs have a log to explicitly show if we are going to run 
tests using pandas/pyarrow, which will help us confirm if they get exercised 
with python 3 in Jenkins or not.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

2018-01-31 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/20465
  
@felixcheung jenkins is actually skipping those tests (see the failure of 
this pr). It makes sense to provide a way to allow developers to not run those 
tests. But, I'd prefer that we run those tests by default. So, we can make sure 
that jenkins is doing the right thing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-31 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r165253818
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -4353,6 +4347,446 @@ def test_unsupported_types(self):
 df.groupby('id').apply(f).collect()
 
 
+@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
+class GroupbyAggPandasUDFTests(ReusedSQLTestCase):
+
+@property
+def data(self):
+from pyspark.sql.functions import array, explode, col, lit
+return self.spark.range(10).toDF('id') \
+.withColumn("vs", array([lit(i * 1.0) + col('id') for i in 
range(20, 30)])) \
+.withColumn("v", explode(col('vs'))) \
+.drop('vs') \
+.withColumn('w', lit(1.0))
+
+@property
+def python_plus_one(self):
+from pyspark.sql.functions import udf
+
+@udf('double')
+def plus_one(v):
+assert isinstance(v, (int, float))
+return v + 1
+return plus_one
+
+@property
+def pandas_scalar_plus_two(self):
+import pandas as pd
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.SCALAR)
+def plus_two(v):
+assert isinstance(v, pd.Series)
+return v + 2
+return plus_two
+
+@property
+def pandas_agg_mean_udf(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.GROUP_AGG)
+def avg(v):
+return v.mean()
+return avg
+
+@property
+def pandas_agg_sum_udf(self):
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.GROUP_AGG)
+def sum(v):
+return v.sum()
+return sum
+
+@property
+def pandas_agg_weighted_mean_udf(self):
+import numpy as np
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+@pandas_udf('double', PandasUDFType.GROUP_AGG)
+def weighted_mean(v, w):
+return np.average(v, weights=w)
+return weighted_mean
+
+def test_manual(self):
+df = self.data
+sum_udf = self.pandas_agg_sum_udf
+mean_udf = self.pandas_agg_mean_udf
+
+result1 = df.groupby('id').agg(sum_udf(df.v), 
mean_udf(df.v)).sort('id')
+expected1 = self.spark.createDataFrame(
+[[0, 245.0, 24.5],
+ [1, 255.0, 25.5],
+ [2, 265.0, 26.5],
+ [3, 275.0, 27.5],
+ [4, 285.0, 28.5],
+ [5, 295.0, 29.5],
+ [6, 305.0, 30.5],
+ [7, 315.0, 31.5],
+ [8, 325.0, 32.5],
+ [9, 335.0, 33.5]],
+['id', 'sum(v)', 'avg(v)'])
+
+self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
+
+def test_basic(self):
+from pyspark.sql.functions import col, lit, sum, mean
+
+df = self.data
+weighted_mean_udf = self.pandas_agg_weighted_mean_udf
+
+# Groupby one column and aggregate one UDF with literal
+result1 = df.groupby('id').agg(weighted_mean_udf(df.v, 
lit(1.0))).sort('id')
+expected1 = 
df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id')
+self.assertPandasEqual(expected1.toPandas(), result1.toPandas())
+
+# Groupby one expression and aggregate one UDF with literal
+result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, 
lit(1.0)))\
+.sort(df.id + 1)
+expected2 = df.groupby((col('id') + 1))\
+.agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort(df.id + 1)
+self.assertPandasEqual(expected2.toPandas(), result2.toPandas())
+
+# Groupby one column and aggregate one UDF without literal
+result3 = df.groupby('id').agg(weighted_mean_udf(df.v, 
df.w)).sort('id')
+expected3 = 
df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, w)')).sort('id')
+self.assertPandasEqual(expected3.toPandas(), result3.toPandas())
+
+# Groupby one expression and aggregate one UDF without literal
+result4 = df.groupby((col('id') + 1).alias('id'))\
+.agg(weighted_mean_udf(df.v, df.w))\
+.sort('id')
+expected4 = df.groupby((col('id') + 1).alias('id'))\
+.agg(mean(df.v).alias('weighted_mean(v, w)'))\
+.sort('id')
+

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-31 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r165253514
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 object PhysicalAggregation {
   // groupingExpressions, aggregateExpressions, resultExpressions, child
   type ReturnType =
-(Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], 
LogicalPlan)
+(Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], 
LogicalPlan)
--- End diff --

I prefer that we try out using a new rule. We can create utility function 
to reuse code. Will you have a chance to try it out?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-31 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r165220142
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 object PhysicalAggregation {
   // groupingExpressions, aggregateExpressions, resultExpressions, child
   type ReturnType =
-(Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], 
LogicalPlan)
+(Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], 
LogicalPlan)
--- End diff --

@icexelloss Thank you for this contribution! I just came across the change 
in this file. I am not sure if changing the type at here is the best option. 
The reason is that whenever we use this PhysicalAggregation rule, we have to 
check the instance type of those aggregate expressions and do casting. To me, 
it seems better to leave this rule untouched and create a new rule just for 
Python UDAF. What do you think?

(maybe you and reviewers already discussed it. If so, can you point me to 
the discussion?)

Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20037: [SPARK-22849] ivy.retrieve pattern should also co...

2018-01-23 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/20037#discussion_r163463718
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -1271,7 +1271,7 @@ private[spark] object SparkSubmitUtils {
 // retrieve all resolved dependencies
 ivy.retrieve(rr.getModuleDescriptor.getModuleRevisionId,
   packagesDirectory.getAbsolutePath + File.separator +
-"[organization]_[artifact]-[revision].[ext]",
+"[organization]_[artifact]-[revision](-[classifier]).[ext]",
--- End diff --

I tried it today. Somehow, I only got the test jar downloaded. Have you 
guys seen this issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...

2017-12-28 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/20110
  
Thank you! Let's also check the build result to make sure 
`pyspark.streaming.tests.FlumePollingStreamTests` is indeed triggered (I hit 
this issue while running this test). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19535: [SPARK-22313][PYTHON] Mark/print deprecation warn...

2017-12-28 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19535#discussion_r159019845
  
--- Diff: python/pyspark/streaming/flume.py ---
@@ -54,8 +54,13 @@ def createStream(ssc, hostname, port,
 :param bodyDecoder:  A function used to decode body (default is 
utf8_decoder)
 :return: A DStream object
 
-.. note:: Deprecated in 2.3.0
+.. note:: Deprecated in 2.3.0. Flume support is deprecated as of 
Spark 2.3.0.
+See SPARK-22142.
 """
+warnings.warn(
--- End diff --

thank you :) It will be good to also check why master build does not fail 
since python should complain about it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19535: [SPARK-22313][PYTHON] Mark/print deprecation warn...

2017-12-28 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19535#discussion_r159013024
  
--- Diff: python/pyspark/streaming/flume.py ---
@@ -54,8 +54,13 @@ def createStream(ssc, hostname, port,
 :param bodyDecoder:  A function used to decode body (default is 
utf8_decoder)
 :return: A DStream object
 
-.. note:: Deprecated in 2.3.0
+.. note:: Deprecated in 2.3.0. Flume support is deprecated as of 
Spark 2.3.0.
+See SPARK-22142.
 """
+warnings.warn(
--- End diff --

Seems `warnings` is not imported in this file?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #5604: [SPARK-1442][SQL] Window Function Support for Spar...

2017-12-19 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/5604#discussion_r157933488
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.analysis.UnresolvedException
+import org.apache.spark.sql.catalyst.errors.TreeNodeException
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types.{NumericType, DataType}
+
+/**
+ * The trait of the Window Specification (specified in the OVER clause or 
WINDOW clause) for
+ * Window Functions.
+ */
+sealed trait WindowSpec
+
+/**
+ * The specification for a window function.
+ * @param partitionSpec It defines the way that input rows are partitioned.
+ * @param orderSpec It defines the ordering of rows in a partition.
+ * @param frameSpecification It defines the window frame in a partition.
+ */
+case class WindowSpecDefinition(
+partitionSpec: Seq[Expression],
+orderSpec: Seq[SortOrder],
+frameSpecification: WindowFrame) extends Expression with WindowSpec {
+
+  def validate: Option[String] = frameSpecification match {
+case UnspecifiedFrame =>
+  Some("Found a UnspecifiedFrame. It should be converted to a 
SpecifiedWindowFrame " +
+"during analysis. Please file a bug report.")
+case frame: SpecifiedWindowFrame => frame.validate.orElse {
+  def checkValueBasedBoundaryForRangeFrame(): Option[String] = {
+if (orderSpec.length > 1)  {
+  // It is not allowed to have a value-based PRECEDING and 
FOLLOWING
+  // as the boundary of a Range Window Frame.
+  Some("This Range Window Frame only accepts at most one ORDER BY 
expression.")
+} else if (orderSpec.nonEmpty && 
!orderSpec.head.dataType.isInstanceOf[NumericType]) {
+  Some("The data type of the expression in the ORDER BY clause 
should be a numeric type.")
+} else {
+  None
+}
+  }
+
+  (frame.frameType, frame.frameStart, frame.frameEnd) match {
+case (RangeFrame, vp: ValuePreceding, _) => 
checkValueBasedBoundaryForRangeFrame()
+case (RangeFrame, vf: ValueFollowing, _) => 
checkValueBasedBoundaryForRangeFrame()
+case (RangeFrame, _, vp: ValuePreceding) => 
checkValueBasedBoundaryForRangeFrame()
+case (RangeFrame, _, vf: ValueFollowing) => 
checkValueBasedBoundaryForRangeFrame()
+case (_, _, _) => None
+  }
+}
+  }
+
+  type EvaluatedType = Any
+
+  override def children: Seq[Expression]  = partitionSpec ++ orderSpec
+
+  override lazy val resolved: Boolean =
+childrenResolved && 
frameSpecification.isInstanceOf[SpecifiedWindowFrame]
+
+
+  override def toString: String = simpleString
+
+  override def eval(input: Row): EvaluatedType = throw new 
UnsupportedOperationException
+  override def nullable: Boolean = true
+  override def foldable: Boolean = false
+  override def dataType: DataType = throw new UnsupportedOperationException
+}
+
+/**
+ * A Window specification reference that refers to the 
[[WindowSpecDefinition]] defined
+ * under the name `name`.
+ */
+case class WindowSpecReference(name: String) extends WindowSpec
+
+/**
+ * The trait used to represent the type of a Window Frame.
+ */
+sealed trait FrameType
+
+/**
+ * RowFrame treats rows in a partition individually. When a 
[[ValuePreceding]]
+ * or a [[ValueFollowing]] is used as its [[FrameBoundary]], the value is 
considered
+ * as a physical offset.
+ * For example, `ROW BETWEEN 1 PRECEDING AND 1 FOLLOWING` represents a 
3-row frame,
+ *

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-13 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/19448
  
Thank you :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-13 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/19448
  
I am not really worried about this particular change. It's already merged 
and it seems a small and safe change. I am not planning to revert it.

But, in general, let's avoid of merging changes that are not bug fixes to a 
maintenance branch. If there is an exception, it will be better to make it 
clear earlier. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-13 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/19448
  
@HyukjinKwon branch-2.2 is in a maintenance branch, I am not sure it is 
appropriate to merge this change to branch-2.2 since it is not really a bug 
fix. If the doc is not accurate, we should fix the doc. For a maintenance 
branch, we need to be very careful on what we merge and we should always avoid 
of unnecessary changes. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule conflict between ...

2017-09-29 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/19149
  
Can we add a test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...

2017-08-30 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19080#discussion_r136214689
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
@@ -30,18 +30,43 @@ import org.apache.spark.sql.types.{DataType, 
IntegerType}
  *  - Intra-partition ordering of data: In this case the distribution 
describes guarantees made
  *about how tuples are distributed within a single partition.
  */
-sealed trait Distribution
+sealed trait Distribution {
+  /**
+   * The required number of partitions for this distribution. If it's 
None, then any number of
+   * partitions is allowed for this distribution.
+   */
+  def requiredNumPartitions: Option[Int]
+
+  /**
+   * Creates a default partitioning for this distribution, which can 
satisfy this distribution while
+   * matching the given number of partitions.
+   */
+  def createPartitioning(numPartitions: Int): Partitioning
+}
 
 /**
  * Represents a distribution where no promises are made about co-location 
of data.
  */
-case object UnspecifiedDistribution extends Distribution
+case object UnspecifiedDistribution extends Distribution {
+  override def requiredNumPartitions: Option[Int] = None
+
+  override def createPartitioning(numPartitions: Int): Partitioning = {
+throw new IllegalStateException("UnspecifiedDistribution does not have 
default partitioning.")
+  }
+}
 
 /**
  * Represents a distribution that only has a single partition and all 
tuples of the dataset
  * are co-located.
  */
-case object AllTuples extends Distribution
+case object AllTuples extends Distribution {
--- End diff --

I'd like to keep `AllTuples`. `SingleNodeDistribution` is a special case of 
`AllTuples` and seems we do not really need the extra information introduced by 
`SingleNode`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...

2017-08-30 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/19080
  
Have a question after reading the new approach. Let's say that we have a 
join like `T1 JOIN T2 on T1.a = T2.a`. Also `T1` is hash partitioned by the 
value of `T1.a` and it has 10 partitions, and `T2` is range partitioned by the 
value of `T2.a` and it has 10 partitions. Both sides will satisfy the required 
distribution of the join. However, we need to add an exchange at either side in 
order to produce the correct result. How will we handle this case with this 
change?

Also, regarding
> For multiple children, Spark only guarantees they have the same number of 
partitions, and it's the operator's responsibility to leverage this guarantee 
to achieve more complicated requirements. 

Can you give a concrete example?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client

2017-08-14 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/18944
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18316: [SPARK-21111] [TEST] [2.2] Fix the test failure of descr...

2017-06-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/18316
  
Thanks! I have merged this pr to branch-2.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18316: [SPARK-21111] [TEST] [2.2] Fix the test failure of descr...

2017-06-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/18316
  
thanks! merging to branch-2.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18316: [SPARK-21111] [TEST] [2.2] Fix the test failure of descr...

2017-06-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/18316
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18064: [SPARK-20213][SQL] Fix DataFrameWriter operations in SQL...

2017-06-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/18064
  
My suggestion was about getting changes on the interfaces of 
ExecutedCommandExec and SaveIntoDataSourceCommand to separate prs. It will help 
code review (both speed and quality).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18148: [SPARK-20926][SQL] Removing exposures to guava library c...

2017-06-07 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/18148
  
@vanzin Seems merging to branch-2.2 was an accident? Since it is not really 
a bug fix, should we revert it from branch-2.2 and just keep it in the master?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18064: [SPARK-20213][SQL] Fix DataFrameWriter operations in SQL...

2017-06-07 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/18064
  
I just case across this pr. I have one general feedback. It will be great 
if we can make a pr have a single purpose. This pr contains different kinds of 
changes in order to fix the UI. If refactoring is needed, I'd recommend to have 
separate PR for refactoring purposes. Then, use a different PR to do the actual 
fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18172: [SPARK-20946][SQL] simplify the config setting logic in ...

2017-06-02 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/18172
  
Reverting this because it breaks repl tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17617: [SPARK-20244][Core] Handle incorrect bytesRead me...

2017-06-02 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17617#discussion_r119938185
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -143,14 +144,29 @@ class SparkHadoopUtil extends Logging {
* Returns a function that can be called to find Hadoop FileSystem bytes 
read. If
* getFSBytesReadOnThreadCallback is called from thread r at time t, the 
returned callback will
* return the bytes read on r since t.
-   *
-   * @return None if the required method can't be found.
--- End diff --

Why removing this line instead of the doc?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17763: [SPARK-13747][Core]Add ThreadUtils.awaitReady and disall...

2017-05-17 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17763
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17666: [SPARK-20311][SQL] Support aliases for table value funct...

2017-05-09 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17666
  
I have reverted this change from both master and branch-2.2. I have 
reopened the jira.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17666: [SPARK-20311][SQL] Support aliases for table value funct...

2017-05-09 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17666
  
I am going to revert this PR from master and branch-2.2. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17666: [SPARK-20311][SQL] Support aliases for table value funct...

2017-05-09 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17666
  
@maropu Sorry. I think this PR introduces a regression.

```
scala> spark.sql("select * from range(1, 10) cross join range(1, 
10)").explain
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for 
INNER join between logical plans
Range (1, 10, step=1, splits=None)
and
Range (1, 10, step=1, splits=None)
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these 
relations.;
```

I think we are taking the cross as the alias. 

I reverted your change locally and the query worked. I am attaching the 
expected analyzed plan below.
```
scala> spark.sql("select * from range(1, 10) cross join range(1, 
10)").queryExecution.analyzed
res1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [id#8L, id#9L]
+- Join Cross
   :- Range (1, 10, step=1, splits=None)
   +- Range (1, 10, step=1, splits=None)

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...

2017-05-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17905
  
i see. I think 
https://github.com/apache/spark/pull/17905/commits/d4c1a9db25ee7386f7b12e4dabb54210a9892510
 is good. How about we get it checked in first (after jenkins passes)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...

2017-05-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17905
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...

2017-05-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17905
  
@falaki's PR did not actually trigger that test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...

2017-05-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17905
  
@felixcheung  you are right. That is the problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17903: [SPARK-20661][SparkR][Test] SparkR tableNames() test fai...

2017-05-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17903
  
I do not think https://github.com/apache/spark/pull/17649 caused the 
problem. I saw failures without that internally. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17903: [SPARK-20661][SparkR][Test] SparkR tableNames() test fai...

2017-05-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17903
  
Thanks @falaki. Merging to master and branch-2.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17903: [SPARK-20661][SparkR][Test] SparkR tableNames() test fai...

2017-05-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17903
  
Seems 2.2 build is fine. But, I'd like to get this merged in branch-2.2 
since this test will fail if any previous tests leak tables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17903: [SPARK-20661][SparkR][Test] SparkR tableNames() test fai...

2017-05-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17903
  
@felixcheung fyi. I think the main problem of this test is that it will be 
broken if tests executed before this one leak any table. I think this change 
makes sense. I will merge it once it passes jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17892: [SPARK-20626][SPARKR] address date test warning with tim...

2017-05-08 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17892
  
@felixcheung Seems master build is broken because R tests are broken 
(https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2844/console).
 I am not sure if this PR caused that. Can you help to take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17746: [SPARK-20449][ML] Upgrade breeze version to 0.13.1

2017-05-01 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17746
  
@dbtsai Thanks for the explanation and the context :) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17746: [SPARK-20449][ML] Upgrade breeze version to 0.13.1

2017-05-01 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17746
  
Can I ask how we decided merging this dependency change after the cut of 
the release branch (especially this change affects user code)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17659: [SPARK-20358] [core] Executors failing stage on interrup...

2017-04-20 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17659
  
lgtm. Merging to master and branch-2.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17531: [SPARK-20217][core] Executor should not fail stage if ki...

2017-04-05 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17531
  
Thanks. Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17531: [SPARK-20217][core] Executor should not fail stage if ki...

2017-04-05 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17531
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17423: [SPARK-20088] Do not create new SparkContext in SparkR c...

2017-03-26 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17423
  
got it. Thanks :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17423: [SPARK-20088] Do not create new SparkContext in SparkR c...

2017-03-25 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17423
  
@felixcheung `SparkContext.getOrCreate` is the preferred way to create a 
SparkContext. So, even we have check, it is still better to use `getOrCreate`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16952: [SPARK-19620][SQL]Fix incorrect exchange coordinator id ...

2017-03-10 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16952
  
LGTM. Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17156: [SPARK-19816][SQL][Tests] Fix an issue that DataFrameCal...

2017-03-03 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17156
  
merged to branch-2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17156: [SPARK-19816][SQL][Tests] Fix an issue that DataFrameCal...

2017-03-03 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/17156
  
Let's also merge this to branch-2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16917: [SPARK-19529][BRANCH-1.6] Backport PR #16866 to branch-1...

2017-02-27 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16917
  
Let's use a meaningful title in future :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test

2017-02-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16935
  
cool. It has been merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test

2017-02-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16935
  
Seems I cannot merge now... Will try again later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test

2017-02-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16935
  
ok. Nothing new to add. I will merge this to master and branch-2.1 (in case 
we want to debug any python test hanging issue in branch-2.1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test

2017-02-14 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16935
  
Let's not merge it right now. I may need to log more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16935: [SPARK-19604] [TESTS] Log the start of every Pyth...

2017-02-14 Thread yhuai

GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/16935

[SPARK-19604] [TESTS] Log the start of every Python test


## What changes were proposed in this pull request?
Right now, we only have info level log after we finish the tests of a 
Python test file. We should also log the start of a test. So, if a test is 
hanging, we can tell which test file is running.

## How was this patch tested?
This is a change for python tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark SPARK-19604

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16935.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16935


commit 1181cc3be7bcf21fbe7e88b35ac662353fb2f366
Author: Yin Huai <yh...@databricks.com>
Date:   2017-02-15T04:19:28Z

Right now, we only have info level log after we finish the tests of a 
Python test file. We should also log the start of a test. So, if a test is 
hanging, we can tell which test file is running.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16894: [SPARK-17897] [SQL] [BACKPORT-2.0] Fixed IsNotNull Const...

2017-02-12 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16894
  
thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16067: [SPARK-17897] [SQL] Fixed IsNotNull Constraint Inference...

2017-02-10 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16067
  
@gatorsmile can we also add it in branch-2.0? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16649: [SPARK-19295] [SQL] IsolatedClientLoader's downloadVersi...

2017-01-19 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16649
  
Cool I am merging this to master. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16645: [SPARK-19290][SQL] add a new extending interface in Anal...

2017-01-19 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16645
  
My main concern of this pr is that if people will think it is recommended 
to add new batches to force those rules running in a certain ordering. For 
these resolution rules, we can also use conditions to control when they will 
fire, right? If we will always replace a logical plan to another one in the 
analysis phase, seems we should use `resolved` to control if a rule will fired. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16649: [SPARK-19295] [SQL] IsolatedClientLoader's downlo...

2017-01-19 Thread yhuai

GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/16649

[SPARK-19295] [SQL] IsolatedClientLoader's downloadVersion should log the 
location of downloaded metastore client jars

## What changes were proposed in this pull request?
This will help the users to know the location of those downloaded jars.

## How was this patch tested?
jenkins


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark SPARK-19295

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16649.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16649


commit 6c67582d85473d123053a45aa051578232c32dad
Author: Yin Huai <yh...@databricks.com>
Date:   2017-01-19T20:30:08Z

[SPARK-19295] IsolatedClientLoader's downloadVersion should log the 
location of downloaded metastore client jars




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16613: [SPARK-19024][SQL] Implement new approach to write a per...

2017-01-18 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16613
  
nvm. After second thought, the feature flag does not really buy us 
anything. We just store the original view definition and the column mapping in 
the metastore. So, I think it is fine to just do the switch. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16628: Update known_translations for contributor names

2017-01-18 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16628
  
I am merging this to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14204: [SPARK-16520] [WEBUI] Link executors to corresponding wo...

2017-01-18 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14204
  
ok I agree. Originally, I thought it will be helpful to figure out the 
worker that an executor belongs to.

But, if it does not provide very useful information. I am fine to drop it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16628: Update known_translations for contributor names

2017-01-18 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16628
  
done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16613: [SPARK-19024][SQL] Implement new approach to write a per...

2017-01-18 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16613
  
is there a feature flag that is used to determine if we use this new 
approach? I feel it will be good to have an internal feature flag to determine 
the code path. So, if there is something wrong that is hard to fix quickly 
before the release, we can still switch back to the old code path. Then, in the 
next release, we can remove the feature flag. What do you think?

Also, @jiangxb1987 can you take a look at the SQLViewSuite and see if we 
have enough test coverage?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...

2017-01-17 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16517
  
Looks good to me. @gatorsmile can you explain your concerns? I am wondering 
what kind of cases that you think HiveFileFormat may not be able to handle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-17 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96566857
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -276,40 +276,31 @@ case class InsertIntoHiveTable(
   }
 }
 
-val jobConf = new JobConf(hadoopConf)
-val jobConfSer = new SerializableJobConf(jobConf)
-
-// When speculation is on and output committer class name contains 
"Direct", we should warn
-// users that they may loss data if they are using a direct output 
committer.
-val speculationEnabled = 
sqlContext.sparkContext.conf.getBoolean("spark.speculation", false)
-val outputCommitterClass = 
jobConf.get("mapred.output.committer.class", "")
-if (speculationEnabled && outputCommitterClass.contains("Direct")) {
--- End diff --

seems this change is unnecessary and users may still use direct output 
committer (they can still find the code on Internet). Let's keep the warning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-17 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96566523
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -276,40 +276,31 @@ case class InsertIntoHiveTable(
   }
 }
 
-val jobConf = new JobConf(hadoopConf)
-val jobConfSer = new SerializableJobConf(jobConf)
-
-// When speculation is on and output committer class name contains 
"Direct", we should warn
-// users that they may loss data if they are using a direct output 
committer.
-val speculationEnabled = 
sqlContext.sparkContext.conf.getBoolean("spark.speculation", false)
-val outputCommitterClass = 
jobConf.get("mapred.output.committer.class", "")
-if (speculationEnabled && outputCommitterClass.contains("Direct")) {
--- End diff --

Do we still need this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-17 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96566290
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.hive.ql.exec.Utilities
+import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
+import org.apache.hadoop.hive.serde2.Serializer
+import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
+import org.apache.hadoop.io.Writable
+import org.apache.hadoop.mapred.{JobConf, Reporter}
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriter, OutputWriterFactory}
+import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil}
+import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableJobConf
+
+/**
+ * `FileFormat` for writing Hive tables.
+ *
+ * TODO: implement the read logic.
+ */
+class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat {
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = {
+throw new UnsupportedOperationException(s"inferSchema is not supported 
for hive data source.")
+  }
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job,
+  options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+val conf = job.getConfiguration
+val tableDesc = fileSinkConf.getTableInfo
+conf.set("mapred.output.format.class", 
tableDesc.getOutputFileFormatClassName)
+
+// Add table properties from storage handler to hadoopConf, so any 
custom storage
+// handler settings can be set to hadoopConf
+HiveTableUtil.configureJobPropertiesForStorageHandler(tableDesc, conf, 
false)
+Utilities.copyTableJobPropertiesToConf(tableDesc, conf)
--- End diff --

Will tableDesc be null?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-17 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96566171
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends 
Rule[LogicalPlan] {
   }
 }
 
+class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case InsertIntoTable(table: MetastoreRelation, partSpec, query, 
overwrite, ifNotExists)
+if hasBeenPreprocessed(table.output, 
table.partitionKeys.toStructType, partSpec, query) =>
+  InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)
+
+case CreateTable(tableDesc, mode, Some(query)) if 
DDLUtils.isHiveTable(tableDesc) =>
+  // Currently `DataFrameWriter.saveAsTable` doesn't support the 
Append mode of hive serde
+  // tables yet.
+  if (mode == SaveMode.Append) {
+throw new AnalysisException(
+  "CTAS for hive serde tables does not support append semantics.")
+  }
+
+  val dbName = 
tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase)
+  CreateHiveTableAsSelectCommand(
+tableDesc.copy(identifier = tableDesc.identifier.copy(database = 
Some(dbName))),
+query,
+mode == SaveMode.Ignore)
+  }
+
+  /**
+   * Returns true if the [[InsertIntoTable]] plan has already been 
preprocessed by analyzer rule
+   * [[PreprocessTableInsertion]]. It is important that this 
rule([[HiveAnalysis]]) has to
+   * be run after [[PreprocessTableInsertion]], to normalize the column 
names in partition spec and
--- End diff --

I think this version is good for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-17 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96549456
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.hive.ql.exec.Utilities
+import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
+import org.apache.hadoop.hive.serde2.Serializer
+import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
+import org.apache.hadoop.io.Writable
+import org.apache.hadoop.mapred.{JobConf, Reporter}
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriter, OutputWriterFactory}
+import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil}
+import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableJobConf
+
+/**
+ * `FileFormat` for writing Hive tables.
+ *
+ * TODO: implement the read logic.
+ */
+class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat {
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = None
--- End diff --

ok. Let's throw an exception at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16628: Update known_translations for contributor names

2017-01-17 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16628
  
cc @lw-lin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16628: Update known_translations for contributor names

2017-01-17 Thread yhuai

GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/16628

Update known_translations for contributor names

## What changes were proposed in this pull request?
Update known_translations per 
https://github.com/apache/spark/pull/16423#issuecomment-269739634


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark known_translations

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16628.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16628


commit bf2f532a8b7ef60d1542415c0d6dacd8571e7bef
Author: Yin Huai <yh...@databricks.com>
Date:   2017-01-18T00:50:08Z

Update known_translations for contributor names




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16423: Update known_translations for contributor names and also...

2017-01-17 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16423
  
Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96316695
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends 
Rule[LogicalPlan] {
   }
 }
 
+class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case InsertIntoTable(table: MetastoreRelation, partSpec, query, 
overwrite, ifNotExists)
+if hasBeenPreprocessed(table.output, 
table.partitionKeys.toStructType, partSpec, query) =>
+  InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)
+
+case CreateTable(tableDesc, mode, Some(query)) if 
DDLUtils.isHiveTable(tableDesc) =>
+  // Currently `DataFrameWriter.saveAsTable` doesn't support the 
Append mode of hive serde
+  // tables yet.
+  if (mode == SaveMode.Append) {
+throw new AnalysisException(
+  "CTAS for hive serde tables does not support append semantics.")
+  }
+
+  val dbName = 
tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase)
+  CreateHiveTableAsSelectCommand(
+tableDesc.copy(identifier = tableDesc.identifier.copy(database = 
Some(dbName))),
+query,
+mode == SaveMode.Ignore)
+  }
+
+  /**
+   * Returns true if the [[InsertIntoTable]] plan has already been 
preprocessed by analyzer rule
+   * [[PreprocessTableInsertion]]. It is important that this 
rule([[HiveAnalysis]]) has to
+   * be run after [[PreprocessTableInsertion]], to normalize the column 
names in partition spec and
--- End diff --

This rule is in the same batch with PreprocessTableInsertion, right? If so, 
we cannot guarantee that PreprocessTableInsertion will always fire first for a 
command before InsertIntoTable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96317272
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.hive.ql.exec.Utilities
+import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
+import org.apache.hadoop.hive.serde2.Serializer
+import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
+import org.apache.hadoop.io.Writable
+import org.apache.hadoop.mapred.{JobConf, Reporter}
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriter, OutputWriterFactory}
+import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil}
+import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableJobConf
+
+/**
+ * `FileFormat` for writing Hive tables.
+ *
+ * TODO: implement the read logic.
+ */
+class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat {
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = None
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job,
+  options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+val conf = job.getConfiguration
+val tableDesc = fileSinkConf.getTableInfo
+conf.set("mapred.output.format.class", 
tableDesc.getOutputFileFormatClassName)
+
+// Add table properties from storage handler to hadoopConf, so any 
custom storage
+// handler settings can be set to hadoopConf
+HiveTableUtil.configureJobPropertiesForStorageHandler(tableDesc, conf, 
false)
+Utilities.copyTableJobPropertiesToConf(tableDesc, conf)
+
+// Avoid referencing the outer object.
+val fileSinkConfSer = fileSinkConf
+new OutputWriterFactory {
+  private val jobConf = new SerializableJobConf(new JobConf(conf))
+  @transient private lazy val outputFormat =
+
jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]]
+
+  override def getFileExtension(context: TaskAttemptContext): String = 
{
+Utilities.getFileExtension(jobConf.value, 
fileSinkConfSer.getCompressed, outputFormat)
+  }
+
+  override def newInstance(
+  path: String,
+  dataSchema: StructType,
+  context: TaskAttemptContext): OutputWriter = {
+new HiveOutputWriter(path, fileSinkConfSer, jobConf.value, 
dataSchema)
+  }
+}
--- End diff --

Should we just create a class instead of using an anonymous class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96317211
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.hive.ql.exec.Utilities
+import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
+import org.apache.hadoop.hive.serde2.Serializer
+import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
+import org.apache.hadoop.io.Writable
+import org.apache.hadoop.mapred.{JobConf, Reporter}
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriter, OutputWriterFactory}
+import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil}
+import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableJobConf
+
+/**
+ * `FileFormat` for writing Hive tables.
+ *
+ * TODO: implement the read logic.
+ */
+class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat {
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = None
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job,
+  options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
--- End diff --

Want to comment the original source of code in this function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96316863
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends 
Rule[LogicalPlan] {
   }
 }
 
+class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case InsertIntoTable(table: MetastoreRelation, partSpec, query, 
overwrite, ifNotExists)
+if hasBeenPreprocessed(table.output, 
table.partitionKeys.toStructType, partSpec, query) =>
+  InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)
+
+case CreateTable(tableDesc, mode, Some(query)) if 
DDLUtils.isHiveTable(tableDesc) =>
+  // Currently `DataFrameWriter.saveAsTable` doesn't support the 
Append mode of hive serde
+  // tables yet.
+  if (mode == SaveMode.Append) {
+throw new AnalysisException(
+  "CTAS for hive serde tables does not support append semantics.")
+  }
+
+  val dbName = 
tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase)
+  CreateHiveTableAsSelectCommand(
+tableDesc.copy(identifier = tableDesc.identifier.copy(database = 
Some(dbName))),
+query,
+mode == SaveMode.Ignore)
+  }
+
+  /**
+   * Returns true if the [[InsertIntoTable]] plan has already been 
preprocessed by analyzer rule
+   * [[PreprocessTableInsertion]]. It is important that this 
rule([[HiveAnalysis]]) has to
+   * be run after [[PreprocessTableInsertion]], to normalize the column 
names in partition spec and
--- End diff --

Or, you mean that we use this function to determine if 
PreprocessTableInsertion has fired?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96316302
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ---
@@ -108,35 +108,30 @@ class QueryExecution(val sparkSession: SparkSession, 
val logical: LogicalPlan) {
 
 
   /**
-   * Returns the result as a hive compatible sequence of strings.  For 
native commands, the
-   * execution is simply passed back to Hive.
+   * Returns the result as a hive compatible sequence of strings. This is 
for testing only.
*/
   def hiveResultString(): Seq[String] = executedPlan match {
 case ExecutedCommandExec(desc: DescribeTableCommand) =>
-  SQLExecution.withNewExecutionId(sparkSession, this) {
--- End diff --

Explain the reason that `SQLExecution.withNewExecutionId(sparkSession, 
this)` is not needed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96316982
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends 
Rule[LogicalPlan] {
   }
 }
 
+class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case InsertIntoTable(table: MetastoreRelation, partSpec, query, 
overwrite, ifNotExists)
+if hasBeenPreprocessed(table.output, 
table.partitionKeys.toStructType, partSpec, query) =>
+  InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)
+
+case CreateTable(tableDesc, mode, Some(query)) if 
DDLUtils.isHiveTable(tableDesc) =>
+  // Currently `DataFrameWriter.saveAsTable` doesn't support the 
Append mode of hive serde
+  // tables yet.
+  if (mode == SaveMode.Append) {
+throw new AnalysisException(
+  "CTAS for hive serde tables does not support append semantics.")
+  }
+
+  val dbName = 
tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase)
+  CreateHiveTableAsSelectCommand(
+tableDesc.copy(identifier = tableDesc.identifier.copy(database = 
Some(dbName))),
+query,
+mode == SaveMode.Ignore)
+  }
+
+  /**
+   * Returns true if the [[InsertIntoTable]] plan has already been 
preprocessed by analyzer rule
+   * [[PreprocessTableInsertion]]. It is important that this 
rule([[HiveAnalysis]]) has to
+   * be run after [[PreprocessTableInsertion]], to normalize the column 
names in partition spec and
--- End diff --

Should this function actually be part of the resolved method of 
InsertIntoTable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96317150
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.hive.ql.exec.Utilities
+import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
+import org.apache.hadoop.hive.serde2.Serializer
+import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
+import org.apache.hadoop.io.Writable
+import org.apache.hadoop.mapred.{JobConf, Reporter}
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriter, OutputWriterFactory}
+import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil}
+import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableJobConf
+
+/**
+ * `FileFormat` for writing Hive tables.
+ *
+ * TODO: implement the read logic.
+ */
+class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat {
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = None
--- End diff --

Is it safe to return None?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16528: [SPARK-19148][SQL] do not expose the external table conc...

2017-01-16 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16528
  
looks good to me. If possible, I'd like to get 
https://github.com/apache/spark/pull/16528/files#r96314156 reverted. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16528: [SPARK-19148][SQL] do not expose the external tab...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16528#discussion_r96314156
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -131,17 +131,15 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
   return 0;
 """
   },
-  foldFunctions = { funCalls =>
-funCalls.zipWithIndex.map { case (funCall, i) =>
-  val comp = ctx.freshName("comp")
-  s"""
-int $comp = $funCall;
-if ($comp != 0) {
-  return $comp;
-}
-  """
-}.mkString
-  })
+  foldFunctions = _.map { funCall =>
--- End diff --

How about we revert this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16561: [SPARK-18801][SQL][FOLLOWUP] Alias the view with ...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16561#discussion_r96313495
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -28,22 +28,60 @@ import org.apache.spark.sql.catalyst.rules.Rule
  */
 
 /**
- * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
- * with a Project and add an alias for each output attribute. The 
attributes are resolved by
- * name. This should be only done after the batch of Resolution, because 
the view attributes are
- * not completely resolved during the batch of Resolution.
+ * Make sure that a view's child plan produces the view's output 
attributes. We try to wrap the
+ * child by:
+ * 1. Generate the `queryOutput` by:
+ *1.1. If the query column names are defined, map the column names to 
attributes in the child
+ * output by name(This is mostly for handling view queries like 
SELECT * FROM ..., the
+ * schema of the referenced table/view may change after the view 
has been created, so we
+ * have to save the output of the query to `viewQueryColumnNames`, 
and restore them during
+ * view resolution, in this way, we are able to get the correct 
view column ordering and
+ * omit the extra columns that we don't require);
+ *1.2. Else set the child output attributes to `queryOutput`.
+ * 2. Map the `queryQutput` to view output by index, if the corresponding 
attributes don't match,
+ *try to up cast and alias the attribute in `queryOutput` to the 
attribute in the view output.
+ * 3. Add a Project over the child, with the new output generated by the 
previous steps.
+ * If the view output doesn't have the same number of columns neither with 
the child output, nor
+ * with the query column names, throw an AnalysisException.
+ *
+ * This should be only done after the batch of Resolution, because the 
view attributes are not
+ * completely resolved during the batch of Resolution.
  */
 case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
-case v @ View(_, output, child) if child.resolved =>
+case v @ View(desc, output, child) if child.resolved && output != 
child.output =>
   val resolver = conf.resolver
-  val newOutput = output.map { attr =>
-val originAttr = findAttributeByName(attr.name, child.output, 
resolver)
-// The dataType of the output attributes may be not the same with 
that of the view output,
-// so we should cast the attribute to the dataType of the view 
output attribute. If the
-// cast can't perform, will throw an AnalysisException.
-Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = 
attr.exprId,
-  qualifier = attr.qualifier, explicitMetadata = 
Some(attr.metadata))
+  val queryColumnNames = desc.viewQueryColumnNames
+  val queryOutput = if (queryColumnNames.nonEmpty) {
+// If the view output doesn't have the same number of columns with 
the query column names,
+// throw an AnalysisException.
+if (output.length != queryColumnNames.length) {
+  throw new AnalysisException(
+s"The view output ${output.mkString("[", ",", "]")} doesn't 
have the same number of " +
+  s"columns with the query column names 
${queryColumnNames.mkString("[", ",", "]")}")
+}
+desc.viewQueryColumnNames.map { colName =>
+  findAttributeByName(colName, child.output, resolver)
+}
+  } else {
+// For view created before Spark 2.2.0, the view text is already 
fully qualified, the plan
+// output is the same with the view output.
+child.output
+  }
+  // Map the attributes in the query output to the attributes in the 
view output by index.
+  val newOutput = output.zip(queryOutput).map {
--- End diff --

Seems we need to check the size of `output` and `queryOutput`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16597: [SPARK-19240][SQL] SET LOCATION is not allowed for table...

2017-01-16 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16597
  
I am not sure if it is worth breaking this behavior. If the table is a 
managed table, it is possible that existing behavior allows users to move a 
table from one managed place to another managed place (e.g. the location of a 
database is changed). It is not clear that breaking this behavior can give us 
any benefit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16568: [SPARK-18971][Core]Upgrade Netty to 4.0.43.Final

2017-01-13 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16568
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-12 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16233
  
@jiangxb1987 Once jira is back, let's create jiras to address follow-up 
issues (probably you have already done that before jira went down).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5214 matches

Mail list logo