[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142059729
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -432,25 +432,31 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   }
 
   test("zero moments") {
-val input = Seq((1, 2)).toDF("a", "b")
-checkAnswer(
-  input.agg(stddev('a), stddev_samp('a), stddev_pop('a), variance('a),
-var_samp('a), var_pop('a), skewness('a), kurtosis('a)),
-  Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0,
-Double.NaN, Double.NaN))
+// In SPARK-21871, we added code to check the actual bytecode size of 
gen'd methods. If the size
+// goes over `hugeMethodLimit`, Spark fails to compile the methods and 
the execution also fails
+// in a test mode. So, we explicitly made this threshold higher here.
+// This workaround can be removed if this issue fixed.
+withSQLConf(SQLConf.CODEGEN_HUGE_METHOD_LIMIT.key -> "16000") {
--- End diff --

sorry, my bad. I'll do so. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142059351
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -432,25 +432,31 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   }
 
   test("zero moments") {
-val input = Seq((1, 2)).toDF("a", "b")
-checkAnswer(
-  input.agg(stddev('a), stddev_samp('a), stddev_pop('a), variance('a),
-var_samp('a), var_pop('a), skewness('a), kurtosis('a)),
-  Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0,
-Double.NaN, Double.NaN))
+// In SPARK-21871, we added code to check the actual bytecode size of 
gen'd methods. If the size
+// goes over `hugeMethodLimit`, Spark fails to compile the methods and 
the execution also fails
+// in a test mode. So, we explicitly made this threshold higher here.
+// This workaround can be removed if this issue fixed.
+withSQLConf(SQLConf.CODEGEN_HUGE_METHOD_LIMIT.key -> "16000") {
--- End diff --

Do you think is it ok to silently fall back to non-codegen cases in tests? 
I though that the too-long function cases also should explicitly throws 
exceptions in tests by checking `!Utils.isTesting && 
sqlContext.conf.codegenFallback`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19405: [SPARK-22178] [SQL] Refresh Persistent Views by R...

2017-10-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19405#discussion_r142059110
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala ---
@@ -474,13 +474,20 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
*/
   override def refreshTable(tableName: String): Unit = {
 val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
-// Temp tables: refresh (or invalidate) any metadata/data cached in 
the plan recursively.
-// Non-temp tables: refresh the metadata cache.
-sessionCatalog.refreshTable(tableIdent)
-
+val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
 // If this table is cached as an InMemoryRelation, drop the original
--- End diff --

This comment should be moved to line 491.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142058947
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging {
 logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
 throw new CompileException(msg, e.getLocation)
 }
+
+// Check if compiled code has a too large function
+methodsToByteCodeSize.foreach { case (name, byteCodeSize) =>
+  if (byteCodeSize > SQLConf.get.hugeMethodLimit) {
+val clazzName = evaluator.getClazz.getSimpleName
+val methodName = name.replace("$", "")
+val msg = s"failed to compile: the size of $clazzName.$methodName 
was $byteCodeSize and " +
+  "this value went over the limit 
`spark.sql.codegen.hugeMethodLimit`" +
+  s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can 
make this limit higher."
+logError(msg)
+val maxLines = SQLConf.get.loggingMaxLinesForCodegen
+logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
--- End diff --

aha, ok. I'll drop.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142058941
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -432,25 +432,31 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   }
 
   test("zero moments") {
-val input = Seq((1, 2)).toDF("a", "b")
-checkAnswer(
-  input.agg(stddev('a), stddev_samp('a), stddev_pop('a), variance('a),
-var_samp('a), var_pop('a), skewness('a), kurtosis('a)),
-  Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0,
-Double.NaN, Double.NaN))
+// In SPARK-21871, we added code to check the actual bytecode size of 
gen'd methods. If the size
+// goes over `hugeMethodLimit`, Spark fails to compile the methods and 
the execution also fails
+// in a test mode. So, we explicitly made this threshold higher here.
+// This workaround can be removed if this issue fixed.
+withSQLConf(SQLConf.CODEGEN_HUGE_METHOD_LIMIT.key -> "16000") {
--- End diff --

Once we can fall back to non-codegen execution later, we can revert those 
changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142058817
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging {
 logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
 throw new CompileException(msg, e.getLocation)
 }
+
+// Check if compiled code has a too large function
+methodsToByteCodeSize.foreach { case (name, byteCodeSize) =>
+  if (byteCodeSize > SQLConf.get.hugeMethodLimit) {
+val clazzName = evaluator.getClazz.getSimpleName
+val methodName = name.replace("$", "")
+val msg = s"failed to compile: the size of $clazzName.$methodName 
was $byteCodeSize and " +
+  "this value went over the limit 
`spark.sql.codegen.hugeMethodLimit`" +
+  s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can 
make this limit higher."
+logError(msg)
+val maxLines = SQLConf.get.loggingMaxLinesForCodegen
+logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
--- End diff --

We already have `logDebug` at line 1072. We don't need to log it again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142058550
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1020,6 +1006,10 @@ abstract class CodeGenerator[InType <: AnyRef, 
OutType <: AnyRef] extends Loggin
 }
 
 object CodeGenerator extends Logging {
+
+  // This is the value of HugeMethodLimit in the OpenJDK JVM settings
+  val DEFAULT_OPENJDK_JVM_HUGE_METHOD_LIMIT = 8000
--- End diff --

ok, I'll update


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142058540
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging {
 logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
 throw new CompileException(msg, e.getLocation)
 }
+
+// Check if compiled code has a too large function
+methodsToByteCodeSize.foreach { case (name, byteCodeSize) =>
+  if (byteCodeSize > SQLConf.get.hugeMethodLimit) {
+val clazzName = evaluator.getClazz.getSimpleName
+val methodName = name.replace("$", "")
+val msg = s"failed to compile: the size of $clazzName.$methodName 
was $byteCodeSize and " +
+  "this value went over the limit 
`spark.sql.codegen.hugeMethodLimit`" +
+  s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can 
make this limit higher."
+logError(msg)
+val maxLines = SQLConf.get.loggingMaxLinesForCodegen
+logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
--- End diff --

aha, I see. I also think this info is less meaning for users, but some 
meaningful for developers? So, how about changing to `logDebug`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142058489
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1020,6 +1006,10 @@ abstract class CodeGenerator[InType <: AnyRef, 
OutType <: AnyRef] extends Loggin
 }
 
 object CodeGenerator extends Logging {
+
+  // This is the value of HugeMethodLimit in the OpenJDK JVM settings
+  val DEFAULT_OPENJDK_JVM_HUGE_METHOD_LIMIT = 8000
--- End diff --

I think just `DEFAULT_JVM_HUGE_METHOD_LIMIT` is ok. The comment already 
explains it is from OpenJDK.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142058408
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging {
 logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
 throw new CompileException(msg, e.getLocation)
 }
+
+// Check if compiled code has a too large function
+methodsToByteCodeSize.foreach { case (name, byteCodeSize) =>
+  if (byteCodeSize > SQLConf.get.hugeMethodLimit) {
+val clazzName = evaluator.getClazz.getSimpleName
+val methodName = name.replace("$", "")
+val msg = s"failed to compile: the size of $clazzName.$methodName 
was $byteCodeSize and " +
+  "this value went over the limit 
`spark.sql.codegen.hugeMethodLimit`" +
+  s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can 
make this limit higher."
+logError(msg)
+val maxLines = SQLConf.get.loggingMaxLinesForCodegen
+logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
+throw new CompileException(msg, null)
--- End diff --

yea, I'll do so. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142058372
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging {
 logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
 throw new CompileException(msg, e.getLocation)
 }
+
+// Check if compiled code has a too large function
+methodsToByteCodeSize.foreach { case (name, byteCodeSize) =>
+  if (byteCodeSize > SQLConf.get.hugeMethodLimit) {
+val clazzName = evaluator.getClazz.getSimpleName
+val methodName = name.replace("$", "")
+val msg = s"failed to compile: the size of $clazzName.$methodName 
was $byteCodeSize and " +
+  "this value went over the limit 
`spark.sql.codegen.hugeMethodLimit`" +
+  s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can 
make this limit higher."
+logError(msg)
+val maxLines = SQLConf.get.loggingMaxLinesForCodegen
+logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
+throw new CompileException(msg, null)
--- End diff --

We should turn back to non-codegen execution for too long function.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142058236
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging {
 logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
 throw new CompileException(msg, e.getLocation)
 }
+
+// Check if compiled code has a too large function
+methodsToByteCodeSize.foreach { case (name, byteCodeSize) =>
+  if (byteCodeSize > SQLConf.get.hugeMethodLimit) {
+val clazzName = evaluator.getClazz.getSimpleName
+val methodName = name.replace("$", "")
+val msg = s"failed to compile: the size of $clazzName.$methodName 
was $byteCodeSize and " +
+  "this value went over the limit 
`spark.sql.codegen.hugeMethodLimit`" +
+  s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can 
make this limit higher."
+logError(msg)
+val maxLines = SQLConf.get.loggingMaxLinesForCodegen
+logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
--- End diff --

Is it useful to log the gen'd code for this case? The code gets compiled so 
it might not helpful to print out the code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142057939
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
 ---
@@ -44,14 +66,24 @@ case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], 
output: Seq[Attribute], chi
 val schemaOut = 
StructType.fromAttributes(output.drop(child.output.length).zipWithIndex
   .map { case (attr, i) => attr.withName(s"_$i") })
 
+val batchSize = conf.arrowMaxRecordsPerBatch
+
+val batchIter = if (batchSize > 0) {
+  new BatchIterator(iter, batchSize)
+} else if (batchSize == 0) {
+  Iterator(iter)
+} else {
+  throw new IllegalArgumentException(s"MaxRecordsPerBatch must be >= 
0, but is $batchSize")
--- End diff --

We should handle negative value of `conf.arrowMaxRecordsPerBatch` as the 
same as `0` according to the doc of the conf:

> When using Apache Arrow, limit the maximum number of records that can be 
written to a single ArrowRecordBatch in memory. If set to zero or negative 
there is no limit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17702
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17702
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82378/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17702
  
**[Test build #82378 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82378/testReport)**
 for PR 17702 at commit 
[`d019656`](https://github.com/apache/spark/commit/d019656971069ebace05757dee56c214d143245c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142055482
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
 @since(2.3)
 def pandas_udf(f=None, returnType=StringType()):
 """
-Creates a :class:`Column` expression representing a user defined 
function (UDF) that accepts
-`Pandas.Series` as input arguments and outputs a `Pandas.Series` of 
the same length.
+Creates a :class:`Column` expression representing a vectorized user 
defined function (UDF).
+
+The user-defined function can define one of the following 
transformations:
+1. One or more `pandas.Series` -> A `pandas.Series`
+
+   This udf is used with `DataFrame.withColumn` and `DataFrame.select`.
+   The returnType should be a primitive data type, e.g., DoubleType()
+
+   Example:
+
+   >>> from pyspark.sql.types import IntegerType, StringType
+   >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
+   >>> @pandas_udf(returnType=StringType())
+   ... def to_upper(s):
+   ... return s.str.upper()
+   ...
+   >>> @pandas_udf(returnType="integer")
+   ... def add_one(x):
+   ... return x + 1
+   ...
+   >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", 
"name", "age"))
+   >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
+   ... .show()  # doctest: +SKIP
+   +--+--++
+   |slen(name)|to_upper(name)|add_one(age)|
+   +--+--++
+   | 8|  JOHN DOE|  22|
+   +--+--++
+
+2. A `pandas.DataFrame` -> A `pandas.DataFrame`
+
+   This udf is used with `GroupedData.apply`
+   The returnType should be a StructType describing the schema of the 
returned
+   `pandas.DataFrame`.
+
+   Example:
+
+   >>> df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 
4.0)], ("id", "v"))
+   >>> @pandas_udf(returnType=df.schema)
+   ... def normalize(df):
+   ... v = df.v
+   ... ret = df.assign(v=(v - v.mean()) / v.std())
--- End diff --

We need `return ret`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142055226
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
 ---
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.TaskContext
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
AttributeSet, Expression, NamedExpression, SortOrder, UnsafeProjection}
+import 
org.apache.spark.sql.catalyst.plans.physical.{ClusteredDistribution, 
Distribution, Partitioning}
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+
+case class FlatMapGroupsInPandasExec(
+grouping: Seq[Expression],
+func: Expression,
+override val output: Seq[Attribute],
+override val child: SparkPlan
+) extends UnaryExecNode {
+
+  val groupingAttributes: Seq[Attribute] = grouping.map {
+case ne: NamedExpression => ne.toAttribute
+  }
+
+  private val pandasFunction = func.asInstanceOf[PythonUDF].func
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def producedAttributes: AttributeSet = AttributeSet(output)
+
+  override def requiredChildDistribution: Seq[Distribution] =
+ClusteredDistribution(groupingAttributes) :: Nil
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+Seq(groupingAttributes.map(SortOrder(_, Ascending)))
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val inputRDD = child.execute()
+
+val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
+val reuseWorker = 
inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
+val chainedFunc = Seq(ChainedPythonFunctions(Seq(pandasFunction)))
+val argOffsets = Array((0 until child.schema.length).toArray)
+
+inputRDD.mapPartitionsInternal { iter =>
+  val grouped = GroupedIterator(iter, groupingAttributes, child.output)
--- End diff --

I was thinking that the implementation on that time doesn't support 
grouping like:

```python
df.groupby(col('id') % 2 == 0).apply(...)
```

but the change I supposed doesn't work either.
The current implementation seems to not support the grouping above, though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19392: [SPARK-22169][SQL] table name with numbers and ch...

2017-10-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19392#discussion_r142054949
  
--- Diff: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ---
@@ -510,11 +510,15 @@ rowFormat
 ;
 
 tableIdentifier
-: (db=identifier '.')? table=identifier
+: (db=identifierPart '.')? table=identifierPart
 ;
 
 functionIdentifier
-: (db=identifier '.')? function=identifier
+: (db=identifierPart '.')? function=identifierPart
+;
+
+identifierPart
+: identifier | BIGINT_LITERAL | SMALLINT_LITERAL | TINYINT_LITERAL | 
BYTELENGTH_LITERAL
--- End diff --

`INTEGER_VALUE`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2017-10-01 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18931
  
ping @cloud-fan @gatorsmile Please take a look for review. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142053683
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging {
 logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
 throw new CompileException(msg, e.getLocation)
 }
+
+// Check if compiled code has a too large function
+methodsToByteCodeSize.foreach { case (name, byteCodeSize) =>
+  if (byteCodeSize > SQLConf.get.hugeMethodLimit) {
+val clazzName = evaluator.getClazz.getSimpleName
+val methodName = name.replace("$", "")
+val msg = s"failed to compile: the size of $clazzName.$methodName 
was $byteCodeSize and " +
+  "this value went over the limit 
`spark.sql.codegen.hugeMethodLimit`" +
+  s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can 
make this limit higher."
+logError(msg)
+val maxLines = SQLConf.get.loggingMaxLinesForCodegen
+logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
+throw new CompileException(msg, null)
--- End diff --

ok, I try to refactor this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142053181
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging {
 logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
 throw new CompileException(msg, e.getLocation)
 }
+
+// Check if compiled code has a too large function
+methodsToByteCodeSize.foreach { case (name, byteCodeSize) =>
+  if (byteCodeSize > SQLConf.get.hugeMethodLimit) {
+val clazzName = evaluator.getClazz.getSimpleName
+val methodName = name.replace("$", "")
+val msg = s"failed to compile: the size of $clazzName.$methodName 
was $byteCodeSize and " +
+  "this value went over the limit 
`spark.sql.codegen.hugeMethodLimit`" +
+  s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can 
make this limit higher."
+logError(msg)
+val maxLines = SQLConf.get.loggingMaxLinesForCodegen
+logInfo(s"\n${CodeFormatter.format(code, maxLines)}")
+throw new CompileException(msg, null)
--- End diff --

Previously, it works although it does not perform well. Now, after this 
change, the query does not work. This could causes regressions. I think we 
should make a change to automatically fall back to the slow mode. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-01 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/18732
  
@HyukjinKwon Thanks for the feedback. I will address those and update 
tomorrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-01 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/18732
  
@rxin, `transform` takes a function: pd.Series -> pd.Series and apply the 
function on all columns:

```
df.show()

 id   v1   v2  v3
 a  1.0  4.0  0.0
 a  2.0  5.0  1.0
 a  3.0  6.0  1.0

df.groupby('id').transform(pandas_udf(lambda v: v - v.mean(), 
DoubleType())).show()

 id   v1   v2v3
 a  -1.0 -1.0-0.67
 a   0.0  0.0 0.33
 a   1.0  1.0 0.33
```

This is mimicking `pd.DataFrame.groupby().transform`

`apply` takes a function: pd.DataFrame -> pd.DataFrame and is similar to 
`flatMapGroups`

The name `apply` is originated from the R paper "The Split-Apply-Combine 
Strategy for Data Analysis" and is used in both pandas and R to describe this 
function, so the name `apply` should be pretty straight forward to 
pandas/python user. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@srowen For example, given input data 1 to 10, if a user queries 10% (or 
even less) percentile, it should return 1, because the first value 1 already 
reaches 10% percentage. Without this fix, it returned 2. I've updated 
description by adding this example in JIRA and PR, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19083
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82375/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19083
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19083
  
**[Test build #82375 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82375/testReport)**
 for PR 19083 at commit 
[`1fa7c1c`](https://github.com/apache/spark/commit/1fa7c1cac9e4711818df6957260acb751b6dd8b2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17862
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17862
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82377/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17862
  
**[Test build #82377 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82377/testReport)**
 for PR 17862 at commit 
[`1f8e984`](https://github.com/apache/spark/commit/1f8e98411401add53dddbb13e94f450c27b7c0fd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17702
  
**[Test build #82378 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82378/testReport)**
 for PR 17702 at commit 
[`d019656`](https://github.com/apache/spark/commit/d019656971069ebace05757dee56c214d143245c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-10-01 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...

2017-10-01 Thread obermeier
Github user obermeier commented on a diff in the pull request:

https://github.com/apache/spark/pull/19408#discussion_r142047496
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -981,7 +981,13 @@ private[spark] object Utils extends Logging {
   return cached
 }
 
-val indx: Int = hostPort.lastIndexOf(':')
+val indx: Int =
+  // Interpret hostPort as literal IPv6 address if it contains two ore 
more colons
+  // scalastyle:off SingleSpaceBetweenRParenAndLCurlyBrace
+  if (hostPort.matches("(([0-9a-f]*):([0-9a-f]*)){2,}")) -1
--- End diff --

Yes a real parser would be much better!! 
I hope the methods will check the input. Like the name resolver... 

At this point I thought more about the separation of the port.
I think it is important to check if two colons exists, otherwise this 
expression accepts hostnames like abc:123


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-01 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
@hvanhovell could you please review this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18704: [SPARK-20783][SQL] Create ColumnVector to abstract exist...

2017-10-01 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/18704
  
@cloud-fan could you please review this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17862
  
**[Test build #82377 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82377/testReport)**
 for PR 17862 at commit 
[`1f8e984`](https://github.com/apache/spark/commit/1f8e98411401add53dddbb13e94f450c27b7c0fd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17862
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17862
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82376/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17862
  
**[Test build #82376 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82376/testReport)**
 for PR 17862 at commit 
[`bf4d955`](https://github.com/apache/spark/commit/bf4d955082ce94028965eb034efab74f6ff8d0b8).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17862
  
**[Test build #82376 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82376/testReport)**
 for PR 17862 at commit 
[`bf4d955`](https://github.com/apache/spark/commit/bf4d955082ce94028965eb034efab74f6ff8d0b8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19083
  
**[Test build #82375 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82375/testReport)**
 for PR 19083 at commit 
[`1fa7c1c`](https://github.com/apache/spark/commit/1fa7c1cac9e4711818df6957260acb751b6dd8b2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17359: [SPARK-20028][SQL] Add aggreagate expression nGrams

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17359
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19403: [R][BUILD][WIP] test

2017-10-01 Thread felixcheung
Github user felixcheung closed the pull request at:

https://github.com/apache/spark/pull/19403


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...

2017-10-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19408#discussion_r142038575
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -981,7 +981,13 @@ private[spark] object Utils extends Logging {
   return cached
 }
 
-val indx: Int = hostPort.lastIndexOf(':')
+val indx: Int =
+  // Interpret hostPort as literal IPv6 address if it contains two ore 
more colons
+  // scalastyle:off SingleSpaceBetweenRParenAndLCurlyBrace
+  if (hostPort.matches("(([0-9a-f]*):([0-9a-f]*)){2,}")) -1
+  // Else last colon defines start of port definition
+  else hostPort.lastIndexOf(':')
--- End diff --

Final nit, use braces for both parts of the if-else


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...

2017-10-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19408#discussion_r142038567
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -981,7 +981,13 @@ private[spark] object Utils extends Logging {
   return cached
 }
 
-val indx: Int = hostPort.lastIndexOf(':')
+val indx: Int =
+  // Interpret hostPort as literal IPv6 address if it contains two ore 
more colons
+  // scalastyle:off SingleSpaceBetweenRParenAndLCurlyBrace
+  if (hostPort.matches("(([0-9a-f]*):([0-9a-f]*)){2,}")) -1
--- End diff --

Hex digits can be uppercase right?
Should the pattern it not be more like `[0-9a-f]*(:[0-9a-f]*)+` match a 
number, then colon-number colon-number pairs, not number-colon-number 
number-colon-number sequences?
It might end up being equivalent because the match is for 0 or more digits.

This allows some strings that it shouldn't like "", but, the purpose 
isn't to catch every possible case I guess. It would fail name resolution.

I thought Inet6Address would just provide parsing for this but I guess not.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...

2017-10-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19408#discussion_r142038556
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -981,7 +981,13 @@ private[spark] object Utils extends Logging {
   return cached
 }
 
-val indx: Int = hostPort.lastIndexOf(':')
+val indx: Int =
+  // Interpret hostPort as literal IPv6 address if it contains two ore 
more colons
+  // scalastyle:off SingleSpaceBetweenRParenAndLCurlyBrace
--- End diff --

Turn the rule back on after the block?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...

2017-10-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19408#discussion_r142038519
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -981,7 +981,13 @@ private[spark] object Utils extends Logging {
   return cached
 }
 
-val indx: Int = hostPort.lastIndexOf(':')
+val indx: Int =
+  // Interpret hostPort as literal IPv6 address if it contains two ore 
more colons
--- End diff --

Nit: ore
You might note here that you're checking that you _don't_ have a 
`[::1]:123` IPv6 address here -- the braces are key.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...

2017-10-01 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19083#discussion_r142037322
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala
 ---
@@ -301,10 +301,10 @@ class AggregateBenchmark extends BenchmarkBase {
 */
   }
 
-  ignore("max function length of wholestagecodegen") {
+  test("max function length of codegen") {
--- End diff --

This should be "ignore"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19083
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19083
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82373/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19083
  
**[Test build #82373 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82373/testReport)**
 for PR 19083 at commit 
[`8744566`](https://github.com/apache/spark/commit/874456648732309b35cefe36d45783eed77ee7b0).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...

2017-10-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19229


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19405: [SPARK-22178] [SQL] Refresh Persistent Views by REFRESH ...

2017-10-01 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19405
  
cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-10-01 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19229
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19401: [SPARK-22176][SQL] Fix overflow issue in Dataset....

2017-10-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19401#discussion_r142035876
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -237,7 +237,7 @@ class Dataset[T] private[sql](
*/
   private[sql] def showString(
   _numRows: Int, truncate: Int = 20, vertical: Boolean = false): 
String = {
-val numRows = _numRows.max(0)
+val numRows = _numRows.max(0).min(Int.MaxValue - 1)
--- End diff --

Spark SQL does not work when the number of rows is close to `Int.MaxValue`. 
The driver will be OOM before finishing the command. Thus, I do not think we 
can hit this extreme case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19083
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19083
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82374/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19083
  
**[Test build #82374 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82374/testReport)**
 for PR 19083 at commit 
[`08d31f3`](https://github.com/apache/spark/commit/08d31f3920edf32353298b5067e083c2643917db).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19083
  
**[Test build #82374 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82374/testReport)**
 for PR 19083 at commit 
[`08d31f3`](https://github.com/apache/spark/commit/08d31f3920edf32353298b5067e083c2643917db).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19083
  
**[Test build #82373 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82373/testReport)**
 for PR 19083 at commit 
[`8744566`](https://github.com/apache/spark/commit/874456648732309b35cefe36d45783eed77ee7b0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19181
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82372/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19181
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19181
  
**[Test build #82372 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82372/testReport)**
 for PR 19181 at commit 
[`e3da029`](https://github.com/apache/spark/commit/e3da02911ec67ba586b9d7db5a87251f39f06cb5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19408: [SPARK-22180][CORE] Allow IPv6

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19408
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6

2017-10-01 Thread obermeier
GitHub user obermeier opened a pull request:

https://github.com/apache/spark/pull/19408

[SPARK-22180][CORE] Allow IPv6

External applications like Apache Cassandra are able to deal with IPv6 
addresses. Libraries like spark-cassandra-connector combine Apache Cassandra 
with Apache Spark.
This combination is very useful IMHO.

One problem is that `org.apache.spark.util.Utils.parseHostPort(hostPort:` 
`String)` takes the last colon to sepperate the port from host path. This 
conflicts with literal IPv6 addresses.

I think we can take `hostPort` as literal IPv6 address if it contains tow 
ore more colons. If IPv6 addresses are enclosed in square brackets port 
definition is still possible.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/obermeier/spark issue/SPARK-22180

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19408.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19408


commit 3783f85b18540ed8746c078ad9c2f12d7167be9d
Author: Stefan Obermeier 
Date:   2017-10-01T14:28:58Z

[SPARK-22180][CORE] Allow IPv6




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19399: [SPARK-22175][WEB-UI] Add status column to history page

2017-10-01 Thread caneGuy
Github user caneGuy commented on the issue:

https://github.com/apache/spark/pull/19399
  
@jerryshao @squito Could you help review this?Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19401: [SPARK-22176][SQL] Fix overflow issue in Dataset....

2017-10-01 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19401#discussion_r142031417
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -237,7 +237,7 @@ class Dataset[T] private[sql](
*/
   private[sql] def showString(
   _numRows: Int, truncate: Int = 20, vertical: Boolean = false): 
String = {
-val numRows = _numRows.max(0)
+val numRows = _numRows.max(0).min(Int.MaxValue - 1)
--- End diff --

hmm, I see. Both is okay to me and WDYT? cc: @gatorsmile 
IMHO it might be still okay to set `[0, Int.MaxValue)` as valid range for 
`show` cuz this is a super extreme corner case. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142029595
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
 @since(2.3)
 def pandas_udf(f=None, returnType=StringType()):
 """
-Creates a :class:`Column` expression representing a user defined 
function (UDF) that accepts
-`Pandas.Series` as input arguments and outputs a `Pandas.Series` of 
the same length.
+Creates a :class:`Column` expression representing a vectorized user 
defined function (UDF).
+
+The user-defined function can define one of the following 
transformations:
+1. One or more `pandas.Series` -> A `pandas.Series`
--- End diff --

Looks we should leave a newline for python doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142029720
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
 @since(2.3)
 def pandas_udf(f=None, returnType=StringType()):
 """
-Creates a :class:`Column` expression representing a user defined 
function (UDF) that accepts
-`Pandas.Series` as input arguments and outputs a `Pandas.Series` of 
the same length.
+Creates a :class:`Column` expression representing a vectorized user 
defined function (UDF).
+
+The user-defined function can define one of the following 
transformations:
+1. One or more `pandas.Series` -> A `pandas.Series`
+
+   This udf is used with `DataFrame.withColumn` and `DataFrame.select`.
+   The returnType should be a primitive data type, e.g., DoubleType()
--- End diff --

little nit: `` `DoubleType()` ``


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142029736
  
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col, values)
 return GroupedData(jgd, self.sql_ctx)
 
+def apply(self, udf_obj):
+"""
+Maps each group of the current [[DataFrame]] using a pandas udf 
and returns the result
--- End diff --

little nit: `[[DataFrame]]` to  ``:class:`DataFrame` ``


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142029696
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
 @since(2.3)
 def pandas_udf(f=None, returnType=StringType()):
 """
-Creates a :class:`Column` expression representing a user defined 
function (UDF) that accepts
-`Pandas.Series` as input arguments and outputs a `Pandas.Series` of 
the same length.
+Creates a :class:`Column` expression representing a vectorized user 
defined function (UDF).
+
+The user-defined function can define one of the following 
transformations:
+1. One or more `pandas.Series` -> A `pandas.Series`
+
+   This udf is used with `DataFrame.withColumn` and `DataFrame.select`.
+   The returnType should be a primitive data type, e.g., DoubleType()
+
+   Example:
+
+   >>> from pyspark.sql.types import IntegerType, StringType
+   >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
+   >>> @pandas_udf(returnType=StringType())
+   ... def to_upper(s):
+   ... return s.str.upper()
+   ...
+   >>> @pandas_udf(returnType="integer")
+   ... def add_one(x):
+   ... return x + 1
+   ...
+   >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", 
"name", "age"))
+   >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
+   ... .show()  # doctest: +SKIP
+   +--+--++
+   |slen(name)|to_upper(name)|add_one(age)|
+   +--+--++
+   | 8|  JOHN DOE|  22|
+   +--+--++
+
+2. A `pandas.DataFrame` -> A `pandas.DataFrame`
--- End diff --

This looks producing an warning here:

```
spark/python/pyspark/sql/functions.py:docstring of 
pyspark.sql.functions.pandas_udf:46: WARNING: Enumerated list ends without a 
blank line; unexpected unindent.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142029802
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
 ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.TaskContext
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
AttributeSet, Expression, NamedExpression, SortOrder, UnsafeProjection}
+import 
org.apache.spark.sql.catalyst.plans.physical.{ClusteredDistribution, 
Distribution, Partitioning}
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+
+case class FlatMapGroupsInPandasExec(
+groupingAttributes: Seq[Attribute],
+func: Expression,
+output: Seq[Attribute],
+child: SparkPlan)
+  extends UnaryExecNode {
+
+  private val pandasFunction = func.asInstanceOf[PythonUDF].func
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def producedAttributes: AttributeSet = AttributeSet(output)
+
+  override def requiredChildDistribution: Seq[Distribution] =
+ClusteredDistribution(groupingAttributes) :: Nil
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+Seq(groupingAttributes.map(SortOrder(_, Ascending)))
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val inputRDD = child.execute()
+
+val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
+val reuseWorker = 
inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
+val chainedFunc = Seq(ChainedPythonFunctions(Seq(pandasFunction)))
+val argOffsets = Array((0 until child.schema.length).toArray)
+
+inputRDD.mapPartitionsInternal { iter =>
+  val grouped = GroupedIterator(iter, groupingAttributes, child.output)
+  val context = TaskContext.get()
+
+  val columnarBatchIter = new ArrowPythonRunner(
+chainedFunc, bufferSize, reuseWorker,
+PythonEvalType.SQL_PANDAS_UDF, argOffsets, child.schema)
+.compute(grouped.map(_._2), context.partitionId(), context)
+
+  val vectorRowIter = new Iterator[InternalRow] {
+private var currentIter = if (columnarBatchIter.hasNext) {
+  val batch = columnarBatchIter.next()
+  // assert(schemaOut.equals(batch.schema),
+  //  s"Invalid schema from pandas_udf: expected $schemaOut, got 
${batch.schema}")
--- End diff --

Let's avoid commented codes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142029714
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
 @since(2.3)
 def pandas_udf(f=None, returnType=StringType()):
 """
-Creates a :class:`Column` expression representing a user defined 
function (UDF) that accepts
-`Pandas.Series` as input arguments and outputs a `Pandas.Series` of 
the same length.
+Creates a :class:`Column` expression representing a vectorized user 
defined function (UDF).
+
+The user-defined function can define one of the following 
transformations:
+1. One or more `pandas.Series` -> A `pandas.Series`
+
+   This udf is used with `DataFrame.withColumn` and `DataFrame.select`.
+   The returnType should be a primitive data type, e.g., DoubleType()
+
+   Example:
+
+   >>> from pyspark.sql.types import IntegerType, StringType
+   >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
+   >>> @pandas_udf(returnType=StringType())
+   ... def to_upper(s):
+   ... return s.str.upper()
+   ...
+   >>> @pandas_udf(returnType="integer")
+   ... def add_one(x):
+   ... return x + 1
+   ...
+   >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", 
"name", "age"))
+   >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
+   ... .show()  # doctest: +SKIP
+   +--+--++
+   |slen(name)|to_upper(name)|add_one(age)|
+   +--+--++
+   | 8|  JOHN DOE|  22|
+   +--+--++
+
+2. A `pandas.DataFrame` -> A `pandas.DataFrame`
+
+   This udf is used with `GroupedData.apply`
+   The returnType should be a StructType describing the schema of the 
returned
+   `pandas.DataFrame`.
+
+   Example:
+
+   >>> df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 
4.0)], ("id", "v"))
+   >>> @pandas_udf(returnType=df.schema)
+   ... def normalize(df):
+   ... v = df.v
+   ... ret = df.assign(v=(v - v.mean()) / v.std())
+   >>> df.groupby('id').apply(normalize).show() # doctest: + SKIP
+  +---+---+
+  | id|  v|
+  +---+---+
+  |  1|-0.7071067811865475|
+  |  1| 0.7071067811865475|
+  |  2|-0.7071067811865475|
+  |  2| 0.7071067811865475|
+  +---+---+
--- End diff --

This produces Python doc output as below:

https://user-images.githubusercontent.com/6477701/31054996-b6f5737a-a6f8-11e7-9559-239ab74b1cb4.png;>

Sounds related with:

```
spark/python/pyspark/sql/functions.py:docstring of 
pyspark.sql.functions.pandas_udf:46: WARNING: Enumerated list ends without a 
blank line; unexpected unindent.
```

warning above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142029786
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
 ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.TaskContext
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
AttributeSet, Expression, NamedExpression, SortOrder, UnsafeProjection}
--- End diff --

FYI, we could use just wild cardimport if it exceed 6 instances - 
https://github.com/databricks/scala-style-guide#imports


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r142029655
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
 @since(2.3)
 def pandas_udf(f=None, returnType=StringType()):
 """
-Creates a :class:`Column` expression representing a user defined 
function (UDF) that accepts
-`Pandas.Series` as input arguments and outputs a `Pandas.Series` of 
the same length.
+Creates a :class:`Column` expression representing a vectorized user 
defined function (UDF).
+
+The user-defined function can define one of the following 
transformations:
+1. One or more `pandas.Series` -> A `pandas.Series`
+
+   This udf is used with `DataFrame.withColumn` and `DataFrame.select`.
+   The returnType should be a primitive data type, e.g., DoubleType()
+
+   Example:
+
+   >>> from pyspark.sql.types import IntegerType, StringType
+   >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
+   >>> @pandas_udf(returnType=StringType())
+   ... def to_upper(s):
+   ... return s.str.upper()
+   ...
+   >>> @pandas_udf(returnType="integer")
+   ... def add_one(x):
+   ... return x + 1
+   ...
+   >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", 
"name", "age"))
+   >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
+   ... .show()  # doctest: +SKIP
+   +--+--++
+   |slen(name)|to_upper(name)|add_one(age)|
+   +--+--++
+   | 8|  JOHN DOE|  22|
+   +--+--++
+
+2. A `pandas.DataFrame` -> A `pandas.DataFrame`
+
+   This udf is used with `GroupedData.apply`
+   The returnType should be a StructType describing the schema of the 
returned
+   `pandas.DataFrame`.
+
+   Example:
+
+   >>> df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 
4.0)], ("id", "v"))
+   >>> @pandas_udf(returnType=df.schema)
+   ... def normalize(df):
+   ... v = df.v
+   ... ret = df.assign(v=(v - v.mean()) / v.std())
--- End diff --

little nit:

```python
_ = df.assign(v=(v - v.mean()) / v.std())
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19181
  
**[Test build #82372 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82372/testReport)**
 for PR 19181 at commit 
[`e3da029`](https://github.com/apache/spark/commit/e3da02911ec67ba586b9d7db5a87251f39f06cb5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19407
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82371/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19407
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19407
  
**[Test build #82371 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82371/testReport)**
 for PR 19407 at commit 
[`44b6ce2`](https://github.com/apache/spark/commit/44b6ce23dfc32c19928533ab7d3f6916a4268562).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #3940 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)**
 for PR 19404 at commit 
[`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
So, @felixcheung, @shaneknapp and @shivaram, looks we have comments, 
https://github.com/apache/spark/pull/19290#issuecomment-21991 and 
https://github.com/apache/spark/pull/19290#issuecomment-22653, to discuss 
further.

The problem looks upgrading `lintr` to `jimhester/lintr@5431140` could 
break other builds in other branches.

Should I rather try to install this for each build, for example, via 
checking the environment variables as a workaround? I checked those before - 
https://github.com/apache/spark/pull/17669#issue-222376466


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19290: [SPARK-22063][R] Fixes lint check failures in R b...

2017-10-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19290


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19290
  
Let me merge this one first. This shouldn't cause any problem to built 
system for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19402: [SPARK-22167][R][BUILD] sparkr packaging issue al...

2017-10-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19402#discussion_r142024980
  
--- Diff: core/pom.xml ---
@@ -499,7 +499,7 @@
   
 
 
-  
..${file.separator}R${file.separator}install-dev${script.extension}
+ 
${project.basedir}${file.separator}..${file.separator}R${file.separator}install-dev${script.extension}
--- End diff --

Looks this tab is inserted mistakenly BTW.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19229
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82369/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19229
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19406
  
The JIRA doesn't explain what this is meant to fix. What case does this 
help get more correct?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19229
  
**[Test build #82369 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82369/testReport)**
 for PR 19229 at commit 
[`1292ce0`](https://github.com/apache/spark/commit/1292ce01ccf24eb40638e748a9cb0b0dbabbb72c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #3940 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)**
 for PR 19404 at commit 
[`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19401: [SPARK-22176][SQL] Fix overflow issue in Dataset....

2017-10-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19401#discussion_r142023186
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -237,7 +237,7 @@ class Dataset[T] private[sql](
*/
   private[sql] def showString(
   _numRows: Int, truncate: Int = 20, vertical: Boolean = false): 
String = {
-val numRows = _numRows.max(0)
+val numRows = _numRows.max(0).min(Int.MaxValue - 1)
--- End diff --

OK, but now you return one fewer row than expected when it's possible to 
return Int.MaxValue. Granted this is an extreme corner case, but that seems 
less compelling than just skipping the display of "more elements" in this case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19407: [SPARK-21667][Streaming] ConsoleSink should not f...

2017-10-01 Thread rekhajoshm
Github user rekhajoshm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19407#discussion_r142023140
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala 
---
@@ -269,7 +269,7 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
 } else {
   val (useTempCheckpointLocation, recoverFromCheckpointLocation) =
 if (source == "console") {
-  (true, false)
+  (true, true)
--- End diff --

Good point @jaceklaskowski updated.thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19407
  
**[Test build #82371 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82371/testReport)**
 for PR 19407 at commit 
[`44b6ce2`](https://github.com/apache/spark/commit/44b6ce23dfc32c19928533ab7d3f6916a4268562).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19407
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82370/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19407
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19407
  
**[Test build #82370 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82370/testReport)**
 for PR 19407 at commit 
[`57e0e26`](https://github.com/apache/spark/commit/57e0e26474b66afd3bd54be061a5982836e28792).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19407: [SPARK-21667][Streaming] ConsoleSink should not f...

2017-10-01 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/19407#discussion_r142022819
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala 
---
@@ -269,7 +269,7 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
 } else {
   val (useTempCheckpointLocation, recoverFromCheckpointLocation) =
 if (source == "console") {
-  (true, false)
+  (true, true)
--- End diff --

Is there any source that uses `recoverFromCheckpointLocation` disabled? 
What's the use case if any?

Remove  `recoverFromCheckpointLocation` here as it's always `true` and make 
it explicit.

The JIRA issue is to fix the exception followed by cleaning the code that 
was needed in the past.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19407
  
**[Test build #82370 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82370/testReport)**
 for PR 19407 at commit 
[`57e0e26`](https://github.com/apache/spark/commit/57e0e26474b66afd3bd54be061a5982836e28792).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19407: [SPARK-21667][Streaming] ConsoleSink should not f...

2017-10-01 Thread rekhajoshm
GitHub user rekhajoshm opened a pull request:

https://github.com/apache/spark/pull/19407

[SPARK-21667][Streaming] ConsoleSink should not fail streaming query with 
checkpointLocation option

## What changes were proposed in this pull request?
Fix to allow recovery on console , avoid checkpoint exception

## How was this patch tested?
existing tests
manual tests [ Replicating error and seeing no checkpoint error after fix]

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rekhajoshm/spark SPARK-21667

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19407.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19407


commit e3677c9fa9697e0d34f9df52442085a6a481c9e9
Author: Rekha Joshi 
Date:   2015-05-05T23:10:08Z

Merge pull request #1 from apache/master

Pulling functionality from apache spark

commit 106fd8eee8f6a6f7c67cfc64f57c1161f76d8f75
Author: Rekha Joshi 
Date:   2015-05-08T21:49:09Z

Merge pull request #2 from apache/master

pull latest from apache spark

commit 0be142d6becba7c09c6eba0b8ea1efe83d649e8c
Author: Rekha Joshi 
Date:   2015-06-22T00:08:08Z

Merge pull request #3 from apache/master

Pulling functionality from apache spark

commit 6c6ee12fd733e3f9902e10faf92ccb78211245e3
Author: Rekha Joshi 
Date:   2015-09-17T01:03:09Z

Merge pull request #4 from apache/master

Pulling functionality from apache spark

commit b123c601e459d1ad17511fd91dd304032154882a
Author: Rekha Joshi 
Date:   2015-11-25T18:50:32Z

Merge pull request #5 from apache/master

pull request from apache/master

commit c73c32aadd6066e631956923725a48d98a18777e
Author: Rekha Joshi 
Date:   2016-03-18T19:13:51Z

Merge pull request #6 from apache/master

pull latest from apache spark

commit 7dbf7320057978526635bed09dabc8cf8657a28a
Author: Rekha Joshi 
Date:   2016-04-05T20:26:40Z

Merge pull request #8 from apache/master

pull latest from apache spark

commit 5e9d71827f8e2e4d07027281b80e4e073e7fecd1
Author: Rekha Joshi 
Date:   2017-05-01T23:00:30Z

Merge pull request #9 from apache/master

Pull apache spark

commit 63d99b3ce5f222d7126133170a373591f0ac67dd
Author: Rekha Joshi 
Date:   2017-09-30T22:26:44Z

Merge pull request #10 from apache/master

pull latest apache spark

commit 57e0e26474b66afd3bd54be061a5982836e28792
Author: rjoshi2 
Date:   2017-10-01T06:57:12Z

[SPARK-21667][Streaming] ConsoleSink should not fail streaming query with 
checkpointLocation option




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >