date:20171009

[GitHub] spark issue #19429: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-09 Thread jomach

Github user jomach commented on the issue:

https://github.com/apache/spark/pull/19429
  
@gatorsmile  I dressed your comments. Still I cannot use the jekyll build...
`SKIP_API=1 jekyll build --incremental
Configuration file: /Users/jorge/Downloads/spark/docs/_config.yml
   Deprecation: The 'gems' configuration option has been renamed to 
'plugins'. Please update your config file accordingly.
Source: /Users/jorge/Downloads/spark/docs
   Destination: /Users/jorge/Downloads/spark/docs/_site
 Incremental build: enabled
  Generating... 
  Liquid Exception: invalid byte sequence in US-ASCII in 
_layouts/redirect.html
`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-09 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r143633890
  
--- Diff: python/pyspark/sql/session.py ---
@@ -510,9 +511,43 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
 except Exception:
 has_pandas = False
 if has_pandas and isinstance(data, pandas.DataFrame):
-if schema is None:
-schema = [str(x) for x in data.columns]
-data = [r.tolist() for r in data.to_records(index=False)]
+if self.conf.get("spark.sql.execution.arrow.enable", 
"false").lower() == "true" \
--- End diff --

The config name was modified to `spark.sql.execution.arrow.enabled` at 
d29d1e87995e02cb57ba3026c945c3cd66bb06e2 and 
af8a34c787dc3d68f5148a7d9975b52650bb7729.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19462: [SPARK-22159][SQL][FOLLOW-UP] Make config names c...

2017-10-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19462


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19462: [SPARK-22159][SQL][FOLLOW-UP] Make config names consiste...

2017-10-09 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19462
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19462: [SPARK-22159][SQL][FOLLOW-UP] Make config names consiste...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19462
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19462: [SPARK-22159][SQL][FOLLOW-UP] Make config names consiste...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19462
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82575/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19462: [SPARK-22159][SQL][FOLLOW-UP] Make config names consiste...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19462
  
**[Test build #82575 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82575/testReport)**
 for PR 19462 at commit 
[`5bef05e`](https://github.com/apache/spark/commit/5bef05e3d84805866103766f6287ecb054dcad68).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143630635
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3376,6 +3376,151 @@ def test_vectorized_udf_empty_partition(self):
 res = df.select(f(col('id')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_varargs(self):
+from pyspark.sql.functions import pandas_udf, col
+df = self.spark.createDataFrame(self.sc.parallelize([Row(id=1)], 
2))
+f = pandas_udf(lambda *v: v[0], LongType())
+res = df.select(f(col('id')))
+self.assertEquals(df.collect(), res.collect())
+
+
+@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
+class GroupbyApplyTests(ReusedPySparkTestCase):
+@classmethod
+def setUpClass(cls):
+ReusedPySparkTestCase.setUpClass()
+cls.spark = SparkSession(cls.sc)
+
+@classmethod
+def tearDownClass(cls):
+ReusedPySparkTestCase.tearDownClass()
+cls.spark.stop()
+
+def assertFramesEqual(self, expected, result):
+msg = ("DataFrames are not equal: " +
+   ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) +
+   ("\n\nResult:\n%s\n%s" % (result, result.dtypes)))
+self.assertTrue(expected.equals(result), msg=msg)
+
+@property
+def data(self):
+from pyspark.sql.functions import array, explode, col, lit
+return self.spark.range(10).toDF('id') \
+.withColumn("vs", array([lit(i) for i in range(20, 30)])) \
+.withColumn("v", explode(col('vs'))).drop('vs')
+
+def test_simple(self):
+from pyspark.sql.functions import pandas_udf
+df = self.data
+
+foo_udf = pandas_udf(
+lambda df: df.assign(v1=df.v * df.id * 1.0, v2=df.v + df.id),
+StructType(
+[StructField('id', LongType()),
+ StructField('v', IntegerType()),
+ StructField('v1', DoubleType()),
+ StructField('v2', LongType())]))
+
+result = df.groupby('id').apply(foo_udf).sort('id').toPandas()
+expected = 
df.toPandas().groupby('id').apply(foo_udf.func).reset_index(drop=True)
+self.assertFramesEqual(expected, result)
+
+def test_decorator(self):
+from pyspark.sql.functions import pandas_udf
+df = self.data
+
+@pandas_udf(StructType(
+[StructField('id', LongType()),
+ StructField('v', IntegerType()),
+ StructField('v1', DoubleType()),
+ StructField('v2', LongType())]))
+def foo(df):
+return df.assign(v1=df.v * df.id * 1.0, v2=df.v + df.id)
+
+result = df.groupby('id').apply(foo).sort('id').toPandas()
+expected = 
df.toPandas().groupby('id').apply(foo.func).reset_index(drop=True)
+self.assertFramesEqual(expected, result)
+
+def test_coerce(self):
+from pyspark.sql.functions import pandas_udf
+df = self.data
+
+foo = pandas_udf(
+lambda df: df,
+StructType([StructField('id', LongType()), StructField('v', 
DoubleType())]))
+
+result = df.groupby('id').apply(foo).sort('id').toPandas()
+expected = 
df.toPandas().groupby('id').apply(foo.func).reset_index(drop=True)
+expected = expected.assign(v=expected.v.astype('float64'))
+self.assertFramesEqual(expected, result)
+
+def test_complex_groupby(self):
+from pyspark.sql.functions import pandas_udf, col
+df = self.data
+
+@pandas_udf(StructType(
+[StructField('id', LongType()),
+ StructField('v', IntegerType()),
+ StructField('norm', DoubleType())]))
+def normalize(pdf):
+v = pdf.v
+return pdf.assign(norm=(v - v.mean()) / v.std())
+
+result = df.groupby(col('id') % 2 == 
0).apply(normalize).sort('id', 'v').toPandas()
+pdf = df.toPandas()
+expected = pdf.groupby(pdf['id'] % 2 == 0).apply(normalize.func)
+expected = expected.sort_values(['id', 'v']).reset_index(drop=True)
+expected = expected.assign(norm=expected.norm.astype('float64'))
+self.assertFramesEqual(expected, result)
+
+def test_empty_groupby(self):
+from pyspark.sql.functions import pandas_udf, col
+df = self.data
+
+@pandas_udf(StructType(
+[StructField('id', LongType()),
+ StructField('v', IntegerType()),
+

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143630813
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -519,3 +519,4 @@ case class CoGroup(
 outputObjAttr: Attribute,
 left: LogicalPlan,
 right: LogicalPlan) extends BinaryNode with ObjectProducer
+
--- End diff --

little nit: let's remove other changes here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143630505
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3376,6 +3376,151 @@ def test_vectorized_udf_empty_partition(self):
 res = df.select(f(col('id')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_varargs(self):
+from pyspark.sql.functions import pandas_udf, col
+df = self.spark.createDataFrame(self.sc.parallelize([Row(id=1)], 
2))
+f = pandas_udf(lambda *v: v[0], LongType())
+res = df.select(f(col('id')))
+self.assertEquals(df.collect(), res.collect())
+
+
+@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
+class GroupbyApplyTests(ReusedPySparkTestCase):
+@classmethod
+def setUpClass(cls):
+ReusedPySparkTestCase.setUpClass()
+cls.spark = SparkSession(cls.sc)
+
+@classmethod
+def tearDownClass(cls):
+ReusedPySparkTestCase.tearDownClass()
+cls.spark.stop()
+
+def assertFramesEqual(self, expected, result):
+msg = ("DataFrames are not equal: " +
+   ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) +
+   ("\n\nResult:\n%s\n%s" % (result, result.dtypes)))
+self.assertTrue(expected.equals(result), msg=msg)
+
+@property
+def data(self):
+from pyspark.sql.functions import array, explode, col, lit
+return self.spark.range(10).toDF('id') \
+.withColumn("vs", array([lit(i) for i in range(20, 30)])) \
+.withColumn("v", explode(col('vs'))).drop('vs')
+
+def test_simple(self):
+from pyspark.sql.functions import pandas_udf
+df = self.data
+
+foo_udf = pandas_udf(
+lambda df: df.assign(v1=df.v * df.id * 1.0, v2=df.v + df.id),
+StructType(
+[StructField('id', LongType()),
+ StructField('v', IntegerType()),
+ StructField('v1', DoubleType()),
+ StructField('v2', LongType())]))
+
+result = df.groupby('id').apply(foo_udf).sort('id').toPandas()
+expected = 
df.toPandas().groupby('id').apply(foo_udf.func).reset_index(drop=True)
+self.assertFramesEqual(expected, result)
+
+def test_decorator(self):
+from pyspark.sql.functions import pandas_udf
+df = self.data
+
+@pandas_udf(StructType(
+[StructField('id', LongType()),
+ StructField('v', IntegerType()),
+ StructField('v1', DoubleType()),
+ StructField('v2', LongType())]))
+def foo(df):
+return df.assign(v1=df.v * df.id * 1.0, v2=df.v + df.id)
+
+result = df.groupby('id').apply(foo).sort('id').toPandas()
+expected = 
df.toPandas().groupby('id').apply(foo.func).reset_index(drop=True)
+self.assertFramesEqual(expected, result)
+
+def test_coerce(self):
+from pyspark.sql.functions import pandas_udf
+df = self.data
+
+foo = pandas_udf(
+lambda df: df,
--- End diff --

ditto: `df` -> `pdf`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143629848
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,30 +2187,66 @@ def udf(f=None, returnType=StringType()):
 @since(2.3)
 def pandas_udf(f=None, returnType=StringType()):
 """
-Creates a :class:`Column` expression representing a user defined 
function (UDF) that accepts
-`Pandas.Series` as input arguments and outputs a `Pandas.Series` of 
the same length.
+Creates a vectorized user defined function (UDF).
 
-:param f: python function if used as a standalone function
+:param f: user-defined function. A python function if used as a 
standalone function
 :param returnType: a :class:`pyspark.sql.types.DataType` object
 
->>> from pyspark.sql.types import IntegerType, StringType
->>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
->>> @pandas_udf(returnType=StringType())
-... def to_upper(s):
-... return s.str.upper()
-...
->>> @pandas_udf(returnType="integer")
-... def add_one(x):
-... return x + 1
-...
->>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", 
"age"))
->>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
-... .show()  # doctest: +SKIP
-+--+--++
-|slen(name)|to_upper(name)|add_one(age)|
-+--+--++
-| 8|  JOHN DOE|  22|
-+--+--++
+The user-defined function can define one of the following 
transformations:
+
+1. One or more `pandas.Series` -> A `pandas.Series`
+
+   This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
+   :meth:`pyspark.sql.DataFrame.select`.
+   The returnType should be a primitive data type, e.g., 
`DoubleType()`.
+   The length of the returned `pandas.Series` must be of the same as 
the input `pandas.Series`.
+
+   >>> from pyspark.sql.types import IntegerType, StringType
+   >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
+   >>> @pandas_udf(returnType=StringType())
+   ... def to_upper(s):
+   ... return s.str.upper()
+   ...
+   >>> @pandas_udf(returnType="integer")
+   ... def add_one(x):
+   ... return x + 1
+   ...
+   >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", 
"name", "age"))
+   >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
+   ... .show()  # doctest: +SKIP
+   +--+--++
+   |slen(name)|to_upper(name)|add_one(age)|
+   +--+--++
+   | 8|  JOHN DOE|  22|
+   +--+--++
+
+2. A `pandas.DataFrame` -> A `pandas.DataFrame`
+
+   This udf is used with :meth:`pyspark.sql.GroupedData.apply`.
--- End diff --

Maybe, `This udf is used with` -> `This udf is only used with` or .. 
probably we should add a `note` here. If I didn't know the context here, I'd 
wonder why it does not work as normal pandas udf ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143630469
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3376,6 +3376,151 @@ def test_vectorized_udf_empty_partition(self):
 res = df.select(f(col('id')))
 self.assertEquals(df.collect(), res.collect())
 
+def test_vectorized_udf_varargs(self):
+from pyspark.sql.functions import pandas_udf, col
+df = self.spark.createDataFrame(self.sc.parallelize([Row(id=1)], 
2))
+f = pandas_udf(lambda *v: v[0], LongType())
+res = df.select(f(col('id')))
+self.assertEquals(df.collect(), res.collect())
+
+
+@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
+class GroupbyApplyTests(ReusedPySparkTestCase):
+@classmethod
+def setUpClass(cls):
+ReusedPySparkTestCase.setUpClass()
+cls.spark = SparkSession(cls.sc)
+
+@classmethod
+def tearDownClass(cls):
+ReusedPySparkTestCase.tearDownClass()
+cls.spark.stop()
+
+def assertFramesEqual(self, expected, result):
+msg = ("DataFrames are not equal: " +
+   ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) +
+   ("\n\nResult:\n%s\n%s" % (result, result.dtypes)))
+self.assertTrue(expected.equals(result), msg=msg)
+
+@property
+def data(self):
+from pyspark.sql.functions import array, explode, col, lit
+return self.spark.range(10).toDF('id') \
+.withColumn("vs", array([lit(i) for i in range(20, 30)])) \
+.withColumn("v", explode(col('vs'))).drop('vs')
+
+def test_simple(self):
+from pyspark.sql.functions import pandas_udf
+df = self.data
+
+foo_udf = pandas_udf(
+lambda df: df.assign(v1=df.v * df.id * 1.0, v2=df.v + df.id),
+StructType(
+[StructField('id', LongType()),
+ StructField('v', IntegerType()),
+ StructField('v1', DoubleType()),
+ StructField('v2', LongType())]))
+
+result = df.groupby('id').apply(foo_udf).sort('id').toPandas()
+expected = 
df.toPandas().groupby('id').apply(foo_udf.func).reset_index(drop=True)
+self.assertFramesEqual(expected, result)
+
+def test_decorator(self):
+from pyspark.sql.functions import pandas_udf
+df = self.data
+
+@pandas_udf(StructType(
+[StructField('id', LongType()),
+ StructField('v', IntegerType()),
+ StructField('v1', DoubleType()),
+ StructField('v2', LongType())]))
+def foo(df):
+return df.assign(v1=df.v * df.id * 1.0, v2=df.v + df.id)
--- End diff --

little nit: I'd call id `pdf` partly to avoid shadowing `df` and partly to 
say `pd.DataFrame`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143630939
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, 
Expression}
+
+/**
+ * Logical nodes specific to PySpark.
+ */
--- End diff --

little nit: I'd remove this comment. I think the name already implies what 
this file contains.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19463: Cleanup comment in RDDSuite test

2017-10-09 Thread sohum2002

Github user sohum2002 commented on the issue:

https://github.com/apache/spark/pull/19463
  
I just added "Removed one comment from RDDSuite." to the PR description. 
Will this suffice?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-09 Thread akopich

Github user akopich commented on the issue:

https://github.com/apache/spark/pull/18924
  
@WeichenXu123, could you please notify @jkbradley once again?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions ...

2017-10-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18966#discussion_r143629417
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -769,16 +769,21 @@ class CodegenContext {
   foldFunctions: Seq[String] => String = _.mkString("", ";\n", ";")): 
String = {
 val blocks = new ArrayBuffer[String]()
 val blockBuilder = new StringBuilder()
+val maxLines = SQLConf.get.maxCodegenLinesPerFunction
--- End diff --

@kiszk You know, I am just afraid new regression could be introduced due to 
this change. Sorry for the delay. I really do not have a better solution. I 
kind of agree on your original solution. Just exclude the characters for 
comment. At least, it becomes better and take a less risk to hit a regression.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions ...

2017-10-09 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18966#discussion_r143628760
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -769,16 +769,21 @@ class CodegenContext {
   foldFunctions: Seq[String] => String = _.mkString("", ";\n", ";")): 
String = {
 val blocks = new ArrayBuffer[String]()
 val blockBuilder = new StringBuilder()
+val maxLines = SQLConf.get.maxCodegenLinesPerFunction
--- End diff --

@gatorsmile Since to make it configurable [takes long 
time](https://github.com/apache/spark/pull/19449#discussion_r143385878), can we 
do it using hard-coded parameter?
Even in this case, this PR makes better since the estimation does not 
include characters for comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHo...

2017-10-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19460


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19460
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19363: [SPARK-22224][Minor]Override toString of KeyValue/Relati...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19363
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19363: [SPARK-22224][Minor]Override toString of KeyValue/Relati...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19363
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82574/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19363: [SPARK-22224][Minor]Override toString of KeyValue/Relati...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19363
  
**[Test build #82574 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82574/testReport)**
 for PR 19363 at commit 
[`fe0d64a`](https://github.com/apache/spark/commit/fe0d64a1d5080d10fe6743f725107221acb9dd62).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19460
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19463: Cleanup comment in RDDSuite test

2017-10-09 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19463
  
Could you please update the description why you want to apply this change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19460
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18664
  
@BryanCutler, BTW, do you think it is possible to de-duplicate timezone 
handling within Python side if we go for 1.?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18664
  
I think I prefer 1. Do you maybe have a preference @ueshin? I believe you 
are more insightful in this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19399: [SPARK-22175][WEB-UI] Add status column to history page

2017-10-09 Thread ajbozarth

Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/19399
  
With @jerryshao comments I'm going to get off the fence firmly against 
this, we already have too many things slowing down the SHS as it is


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...

2017-10-09 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19082
  
Aha, I feel fair enough. Based the insight, there is one of solutions to 
make the wholestage codegen consider #calls of gen'd functions though, it seems 
the approach is not simple. So, splitting functions step-by-step is a preferred 
approach now...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82577 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82577/testReport)**
 for PR 18732 at commit 
[`a064b21`](https://github.com/apache/spark/commit/a064b21b23d2c3dee9993c3b07d771fa8c09b8ba).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-09 Thread icexelloss

Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/18732
  
Merged some last minute changes from @BryanCutler to make the wrapping a 
bit cleaner. Thanks @BryanCutler!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-10-09 Thread fjh100456

Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r143624224
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -68,6 +68,26 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 .get("mapreduce.output.fileoutputformat.compress.type"))
 }
 
+fileSinkConf.tableInfo.getOutputFileFormatClassName match {
+  case formatName if formatName.endsWith("ParquetOutputFormat") =>
+val compressionConf = "parquet.compression"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.parquetCompressionCodec) match {
+  case "NONE" => "UNCOMPRESSED"
+  case _@x => x
+}
+hadoopConf.set(compressionConf, compressionCodec)
+  case formatName if formatName.endsWith("OrcOutputFormat") =>
+val compressionConf = "orc.compress"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.orcCompressionCodec) match {
+  case "UNCOMPRESSED" => "NONE"
--- End diff --

Yes, they are different, the style of parameter names and parameter values 
are all different, and should be parquet and orc problems.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-10-09 Thread fjh100456

Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r143624210
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -68,6 +68,26 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 .get("mapreduce.output.fileoutputformat.compress.type"))
 }
 
+fileSinkConf.tableInfo.getOutputFileFormatClassName match {
+  case formatName if formatName.endsWith("ParquetOutputFormat") =>
+val compressionConf = "parquet.compression"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.parquetCompressionCodec) match {
+  case "NONE" => "UNCOMPRESSED"
+  case _@x => x
+}
+hadoopConf.set(compressionConf, compressionCodec)
+  case formatName if formatName.endsWith("OrcOutputFormat") =>
+val compressionConf = "orc.compress"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.orcCompressionCodec) match {
+  case "UNCOMPRESSED" => "NONE"
+  case _@x => x
--- End diff --

In fact, the following process will check the correctness of this value, 
and because "orcoptions" is not accessable here, I have to add the 
"uncompressed" => "NONE" conversion.
Do you have any good advice?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-10-09 Thread fjh100456

Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r143624196
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -68,6 +68,26 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 .get("mapreduce.output.fileoutputformat.compress.type"))
 }
 
+fileSinkConf.tableInfo.getOutputFileFormatClassName match {
+  case formatName if formatName.endsWith("ParquetOutputFormat") =>
+val compressionConf = "parquet.compression"
+val compressionCodec = getCompressionByPriority(fileSinkConf, 
compressionConf,
+  sparkSession.sessionState.conf.parquetCompressionCodec) match {
--- End diff --

`compressionConf` will be used below, I've adjusted the format, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-10-09 Thread fjh100456

Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19218#discussion_r143624181
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ---
@@ -68,6 +68,26 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 .get("mapreduce.output.fileoutputformat.compress.type"))
 }
 
+fileSinkConf.tableInfo.getOutputFileFormatClassName match {
+  case formatName if formatName.endsWith("ParquetOutputFormat") =>
--- End diff --

Sounds good idea.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82576 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82576/testReport)**
 for PR 18732 at commit 
[`b0410a2`](https://github.com/apache/spark/commit/b0410a25f710029e93caf69d9037c843e63f0c41).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143622623
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -435,6 +435,35 @@ class RelationalGroupedDataset protected[sql](
   df.logicalPlan.output,
   df.logicalPlan))
   }
+
+  /**
+   * Applies a vectorized python use-defined function to each group of 
data.
--- End diff --

Thanks! Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143622617
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
 ---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.TaskContext
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, 
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Physical node for 
[[org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas]]
+ *
+ * Rows in each group are passed to the python worker as a Arrow record 
batch.
+ * The python worker turns the record batch to a pandas.DataFrame, invoke 
the
+ * use-defined function, and passes the resulting pandas.DataFrame
--- End diff --

Thanks! Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19463: Cleanup comment in RDDSuite test

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19463
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19463: Cleanup comment in RDDSuite test

2017-10-09 Thread sohum2002

GitHub user sohum2002 opened a pull request:

https://github.com/apache/spark/pull/19463

Cleanup comment in RDDSuite test

## What changes were proposed in this pull request?

There were not changes proposed in this pull request.

## How was this patch tested?

There were not tests in this pull request.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sohum2002/spark cleanup-RDDSuite-test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19463.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19463


commit c83ab1e5c51311ecb293e47e9c9694a9a49cfbaa
Author: Sachathamakul, Patrachai (Agoda) 
Date:   2017-10-10T03:14:27Z

Cleanup comment in RDDSuite test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82573/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19459
  
**[Test build #82573 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82573/testReport)**
 for PR 19459 at commit 
[`9d667c6`](https://github.com/apache/spark/commit/9d667c6fcb7e47169a2e48ec130fbdbb42a21f41).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

2017-10-09 Thread sohum2002

Github user sohum2002 closed the pull request at:

https://github.com/apache/spark/pull/19454


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

2017-10-09 Thread sohum2002

Github user sohum2002 commented on the issue:

https://github.com/apache/spark/pull/19454
  
Thank you all for your comments. I hope to improve in my future PRs. Cheers!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

2017-10-09 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19454
  
Honestly I don't think it is worth doing this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19462: [SPARK-22159][SQL][FOLLOW-UP] Make config names consiste...

2017-10-09 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19462
  
cc @rxin @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19462: [SPARK-22159][SQL][FOLLOW-UP] Make config names consiste...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19462
  
**[Test build #82575 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82575/testReport)**
 for PR 19462 at commit 
[`5bef05e`](https://github.com/apache/spark/commit/5bef05e3d84805866103766f6287ecb054dcad68).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19462: [SPARK-22159][SQL][FOLLOW-UP] Make config names c...

2017-10-09 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/19462

[SPARK-22159][SQL][FOLLOW-UP] Make config names consistently end with 
"enabled".

## What changes were proposed in this pull request?

This is a follow-up of #19384.

In the previous pr, only definitions of the config names were modified, but 
we also need to modify the names in runtime or tests specified as string 
literal.

## How was this patch tested?

Existing tests but modified the config names.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-22159/fup1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19462.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19462


commit 5bef05e3d84805866103766f6287ecb054dcad68
Author: Takuya UESHIN 
Date:   2017-10-10T02:23:47Z

Fix config names.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...

2017-10-09 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19082
  
The above reasoning also explains the motivation and the effect of #18931 
too.

The generated codes of query operators are extracted to individual smaller 
functions. It is beneficial to step in by JIT.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...

2017-10-09 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19082
  
@maropu The codes to do aggregation are actually wrapped in a function 
`doAggregateWithKeys`/`doAggregateWithoutKey`. This is also the part of 
generated codes this PR improves by extracting functions.

My initial thought is, during the processing of the query, this function 
`doAggregateWithKeys`/`doAggregateWithoutKey` actually only runs once to 
aggregate on all rows. No matter it is a long function or not, we don't have 
chance for JIT to step in. That said the length of this function doesn't impact 
too much in JIT issue.

The long function issue affects the performance of wholestage codegen, 
because it is run many times in non-compiled way. It drags the speed of other 
generated codes. Since `doAggregateWithKeys`/`doAggregateWithoutKey` only run 
once, it doesn't impact much. So wholestage codegen query is still faster than 
non-wholestage codegen one.

This PR improves the aggregation because it extracts small functions from 
`doAggregateWithKeys`/`doAggregateWithoutKey`. Those functions will be run many 
times in the wrapping function. So JIT has room to step in now.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19460
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82572/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19460
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19460
  
**[Test build #82572 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82572/testReport)**
 for PR 19460 at commit 
[`0f82f2d`](https://github.com/apache/spark/commit/0f82f2d5d49fc64e7a8ac4714900417a55ba72d1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19363: [SPARK-22224][Minor]Override toString of KeyValue/Relati...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19363
  
**[Test build #82574 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82574/testReport)**
 for PR 19363 at commit 
[`fe0d64a`](https://github.com/apache/spark/commit/fe0d64a1d5080d10fe6743f725107221acb9dd62).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143614190
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -435,6 +435,35 @@ class RelationalGroupedDataset protected[sql](
   df.logicalPlan.output,
   df.logicalPlan))
   }
+
+  /**
+   * Applies a vectorized python use-defined function to each group of 
data.
--- End diff --

nit: `use-defined` -> `user-defined`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143614283
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
 ---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.TaskContext
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, 
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Physical node for 
[[org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas]]
+ *
+ * Rows in each group are passed to the python worker as a Arrow record 
batch.
+ * The python worker turns the record batch to a pandas.DataFrame, invoke 
the
+ * use-defined function, and passes the resulting pandas.DataFrame
--- End diff --

nit: `use-defined` -> `user-defined`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19399: [SPARK-22175][WEB-UI] Add status column to history page

2017-10-09 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19399
  
I agree with @squito that the criteria to define application's success 
should be well considered. Here  in your current code, only if all the jobs are 
successful then the application is marked as successful, is it too strict that 
we cannot allow any failure and retry? Besides, if an application is 
successfully running all the Spark jobs, but fail on their own code (eg, saving 
to DB), and the application is exited with non-zero code, shall we mark the 
application succeed or failure?

Also the structure to track all the jobs `jobToStatus ` will increase the 
memory occupation indefinitely in long running application.

Besides with your changes I can see that page loading time will be 
increased, for those applications which have many jobs (like Spark Streaming) 
the problem will be severe.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...

2017-10-09 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19082
  
Either way, I think we first need to know why the regression on `q66` 
happens when turning off wholestage codegen. We first thought turning off 
too-long functions had better performance, but it is not always true. Also, we 
better know if this regression happens on other JVM impls, too. Next, I'll look 
into these issues.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19454#discussion_r143612478
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2543,6 +2543,14 @@ class Dataset[T] private[sql](
 mapPartitions(_.flatMap(func))
 
   /**
+* Returns a new Dataset by by flattening a traversable collection into 
a collection itself.
+*
--- End diff --

(and `by by` -> by` I guess)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r143610100
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -203,4 +205,16 @@ private[sql] object ArrowConverters {
   reader.close()
 }
   }
+
+  def toDataFrame(
--- End diff --

Yup, I think we should put it there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

2017-10-09 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19454#discussion_r143608933
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -63,6 +63,7 @@ class RDDSuite extends SparkFunSuite with 
SharedSparkContext {
 assert(nums.map(_.toString).collect().toList === List("1", "2", "3", 
"4"))
 assert(nums.filter(_ > 2).collect().toList === List(3, 4))
 assert(nums.flatMap(x => 1 to x).collect().toList === List(1, 1, 2, 1, 
2, 3, 1, 2, 3, 4))
+assert(sc.makeRDD(Array(Array(1,2,3,4), Array(1,2,3,4))).flatten == 
List(1,2,3,4,1,2,3,4))
--- End diff --

`.flatten.collect().toList`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

2017-10-09 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19454#discussion_r143607680
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -382,6 +382,13 @@ abstract class RDD[T: ClassTag](
   }
 
   /**
+* Return a new RDD by flattening a traversable collection into a 
collection itself.
+*/
--- End diff --

Please follow existing comment style like line 392.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19459
  
**[Test build #82573 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82573/testReport)**
 for PR 19459 at commit 
[`9d667c6`](https://github.com/apache/spark/commit/9d667c6fcb7e47169a2e48ec130fbdbb42a21f41).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-09 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r143607522
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -203,4 +205,16 @@ private[sql] object ArrowConverters {
   reader.close()
 }
   }
+
+  def toDataFrame(
--- End diff --

I had to make this public to be callable with py4j.  Alternatively, 
something could be added to `o.a.s.sql.api.python.PythonSQLUtils`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-09 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r143606693
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3147,6 +3150,14 @@ def test_filtered_frame(self):
 self.assertEqual(pdf.columns[0], "i")
 self.assertTrue(pdf.empty)
 
+def test_createDataFrame_toggle(self):
+pdf = self.createPandasDataFrameFromeData()
+self.spark.conf.set("spark.sql.execution.arrow.enable", "false")
+df_no_arrow = self.spark.createDataFrame(pdf)
+self.spark.conf.set("spark.sql.execution.arrow.enable", "true")
--- End diff --

Hmmm, I thought the `tearDownClass` was there but it's actually in #18664.  
Maybe I should put it in here since that needs some more discussion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

2017-10-09 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19454#discussion_r143606572
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2543,6 +2543,14 @@ class Dataset[T] private[sql](
 mapPartitions(_.flatMap(func))
 
   /**
+* Returns a new Dataset by by flattening a traversable collection into 
a collection itself.
+*
--- End diff --

@group typedrel?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19460
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82567/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19460
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-09 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r143605840
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3147,6 +3150,14 @@ def test_filtered_frame(self):
 self.assertEqual(pdf.columns[0], "i")
 self.assertTrue(pdf.empty)
 
+def test_createDataFrame_toggle(self):
+pdf = self.createPandasDataFrameFromeData()
+self.spark.conf.set("spark.sql.execution.arrow.enable", "false")
+df_no_arrow = self.spark.createDataFrame(pdf)
+self.spark.conf.set("spark.sql.execution.arrow.enable", "true")
--- End diff --

done. I guess this would make the failure easier to see?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19460
  
**[Test build #82567 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82567/testReport)**
 for PR 19460 at commit 
[`90ecbcc`](https://github.com/apache/spark/commit/90ecbcc4f6909d7243a69014d5f76753fb451452).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19442: [SPARK-8515][ML][WIP] Improve ML Attribute API

2017-10-09 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19442
  
@VDuda Thanks for asking. This is a big change. I hope this PR can resolve 
SPARK-8515.

Most APIs are ready. I'm working on the compatibility with current 
attribute APIs. When it is ready, I'll re-open this for review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19460
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82564/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19460
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19460
  
**[Test build #82564 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82564/testReport)**
 for PR 19460 at commit 
[`09a1d5f`](https://github.com/apache/spark/commit/09a1d5fb689a979e5a48de7f90dfbc1f066bea86).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19461: [SPARK-22230] Swap per-row order in state store restore.

2017-10-09 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19461
  
Discussed offline. We don't need to backport to branch-2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-09 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18664
  
Ok sounds good.  Could I get some opinions on the best way to convert 
internal Spark timestamps since they are stored as UTC time?  I think we have 
the following options:

1. Write Arrow data with SESSION_LOCAL timestamp (as is currently in this 
PR), then convert to local timezone without timestamp in Python after the data 
is loaded into Pandas.  This would be at the end of `toPandas()` or just before 
the user function is called in `pandas_udf`s, and convert back to UTC again 
just after.

2. Convert Spark internal data to local timezone without timestamp in Scala 
and write to Arrow data as timezone naive.

With (1) it's easy to do the conversion with Pandas, but we have to make 
sure it gets done at multiple places.  With (2), it's just in one spot but I'm 
not sure if it's possible to end up doing the conversion more than once


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19269
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19269
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82571/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19269
  
**[Test build #82571 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82571/testReport)**
 for PR 19269 at commit 
[`2d41e44`](https://github.com/apache/spark/commit/2d41e44a1ae4067e55d19cf0425a8eb2e7d97b2a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class WriteToDataSourceV2Command(writer: DataSourceV2Writer, 
query: LogicalPlan)`
  * `class RowToInternalRowDataWriteFactory(rowWriterFactory: 
DataWriteFactory[Row], schema: StructType)`
  * `class RowToInternalRowDataWriter(rowWriter: DataWriter[Row], encoder: 
ExpressionEncoder[Row])`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...

2017-10-09 Thread ajbozarth

Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/19270
  
So I think I know why the appId was handled the way it was, the live app ui 
no longer works because the appId var is "undefined" in all the api calls


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19270
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19270
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82563/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19270
  
**[Test build #82563 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82563/testReport)**
 for PR 19270 at commit 
[`0b2a8cf`](https://github.com/apache/spark/commit/0b2a8cfaab8fa6bcb92176f74dce2f47ba65454d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19461: [SPARK-22230] Swap per-row order in state store restore.

2017-10-09 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19461
  
Oh, there are some conflicts with 2.2. @joseph-torres could you submit a 
backport PR, please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19461: [SPARK-22230] Swap per-row order in state store r...

2017-10-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19461


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19461: [SPARK-22230] Swap per-row order in state store restore.

2017-10-09 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19461
  
Thanks! Merging to master and 2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19461: [SPARK-22230] Swap per-row order in state store restore.

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19461
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19461: [SPARK-22230] Swap per-row order in state store restore.

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19461
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82566/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19461: [SPARK-22230] Swap per-row order in state store restore.

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19461
  
**[Test build #82566 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82566/testReport)**
 for PR 19461 at commit 
[`17ef8a8`](https://github.com/apache/spark/commit/17ef8a843e7dec8da0625caeda213cb1f5c64a4a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r143600411
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3147,6 +3150,14 @@ def test_filtered_frame(self):
 self.assertEqual(pdf.columns[0], "i")
 self.assertTrue(pdf.empty)
 
+def test_createDataFrame_toggle(self):
+pdf = self.createPandasDataFrameFromeData()
+self.spark.conf.set("spark.sql.execution.arrow.enable", "false")
+df_no_arrow = self.spark.createDataFrame(pdf)
+self.spark.conf.set("spark.sql.execution.arrow.enable", "true")
--- End diff --

I'd set this to `true` in `finally` just in case the test failed in 
`df_no_arrow = self.spark.createDataFrame(pdf)` and 
`spark.sql.execution.arrow.enable` reminds `false` affecting other test cases 
if I didn't miss something.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19433
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82570/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19433
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19433
  
**[Test build #82570 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82570/testReport)**
 for PR 19433 at commit 
[`abc86b2`](https://github.com/apache/spark/commit/abc86b2042e0fd42cc0e9fe20cf79967b16e9779).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19460
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19460
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82562/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHolder an...

2017-10-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19460
  
**[Test build #82562 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82562/testReport)**
 for PR 19460 at commit 
[`92a6d2d`](https://github.com/apache/spark/commit/92a6d2d53aea02042d47888e99df5a4f2167cd1f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-09 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18664
  
Yup, I think we already don't have timezone in `udf` too? I think we are 
fine as long as it keeps the existing behaviour. Let's don't forget to handle 
all those cases when we deal with timezone in a separate PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19250: [SPARK-12297] Table timezone correction for Timestamps

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19250
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82561/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19250: [SPARK-12297] Table timezone correction for Timestamps

2017-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19250
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 345 matches

Mail list logo