[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142059729 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -432,25 +432,31 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { } test("zero moments") { -val input = Seq((1, 2)).toDF("a", "b") -checkAnswer( - input.agg(stddev('a), stddev_samp('a), stddev_pop('a), variance('a), -var_samp('a), var_pop('a), skewness('a), kurtosis('a)), - Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0, -Double.NaN, Double.NaN)) +// In SPARK-21871, we added code to check the actual bytecode size of gen'd methods. If the size +// goes over `hugeMethodLimit`, Spark fails to compile the methods and the execution also fails +// in a test mode. So, we explicitly made this threshold higher here. +// This workaround can be removed if this issue fixed. +withSQLConf(SQLConf.CODEGEN_HUGE_METHOD_LIMIT.key -> "16000") { --- End diff -- sorry, my bad. I'll do so. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142059351 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -432,25 +432,31 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { } test("zero moments") { -val input = Seq((1, 2)).toDF("a", "b") -checkAnswer( - input.agg(stddev('a), stddev_samp('a), stddev_pop('a), variance('a), -var_samp('a), var_pop('a), skewness('a), kurtosis('a)), - Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0, -Double.NaN, Double.NaN)) +// In SPARK-21871, we added code to check the actual bytecode size of gen'd methods. If the size +// goes over `hugeMethodLimit`, Spark fails to compile the methods and the execution also fails +// in a test mode. So, we explicitly made this threshold higher here. +// This workaround can be removed if this issue fixed. +withSQLConf(SQLConf.CODEGEN_HUGE_METHOD_LIMIT.key -> "16000") { --- End diff -- Do you think is it ok to silently fall back to non-codegen cases in tests? I though that the too-long function cases also should explicitly throws exceptions in tests by checking `!Utils.isTesting && sqlContext.conf.codegenFallback`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19405: [SPARK-22178] [SQL] Refresh Persistent Views by R...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19405#discussion_r142059110 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala --- @@ -474,13 +474,20 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { */ override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) -// Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. -// Non-temp tables: refresh the metadata cache. -sessionCatalog.refreshTable(tableIdent) - +val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) // If this table is cached as an InMemoryRelation, drop the original --- End diff -- This comment should be moved to line 491. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142058947 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging { logInfo(s"\n${CodeFormatter.format(code, maxLines)}") throw new CompileException(msg, e.getLocation) } + +// Check if compiled code has a too large function +methodsToByteCodeSize.foreach { case (name, byteCodeSize) => + if (byteCodeSize > SQLConf.get.hugeMethodLimit) { +val clazzName = evaluator.getClazz.getSimpleName +val methodName = name.replace("$", "") +val msg = s"failed to compile: the size of $clazzName.$methodName was $byteCodeSize and " + + "this value went over the limit `spark.sql.codegen.hugeMethodLimit`" + + s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can make this limit higher." +logError(msg) +val maxLines = SQLConf.get.loggingMaxLinesForCodegen +logInfo(s"\n${CodeFormatter.format(code, maxLines)}") --- End diff -- aha, ok. I'll drop. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142058941 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -432,25 +432,31 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { } test("zero moments") { -val input = Seq((1, 2)).toDF("a", "b") -checkAnswer( - input.agg(stddev('a), stddev_samp('a), stddev_pop('a), variance('a), -var_samp('a), var_pop('a), skewness('a), kurtosis('a)), - Row(Double.NaN, Double.NaN, 0.0, Double.NaN, Double.NaN, 0.0, -Double.NaN, Double.NaN)) +// In SPARK-21871, we added code to check the actual bytecode size of gen'd methods. If the size +// goes over `hugeMethodLimit`, Spark fails to compile the methods and the execution also fails +// in a test mode. So, we explicitly made this threshold higher here. +// This workaround can be removed if this issue fixed. +withSQLConf(SQLConf.CODEGEN_HUGE_METHOD_LIMIT.key -> "16000") { --- End diff -- Once we can fall back to non-codegen execution later, we can revert those changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142058817 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging { logInfo(s"\n${CodeFormatter.format(code, maxLines)}") throw new CompileException(msg, e.getLocation) } + +// Check if compiled code has a too large function +methodsToByteCodeSize.foreach { case (name, byteCodeSize) => + if (byteCodeSize > SQLConf.get.hugeMethodLimit) { +val clazzName = evaluator.getClazz.getSimpleName +val methodName = name.replace("$", "") +val msg = s"failed to compile: the size of $clazzName.$methodName was $byteCodeSize and " + + "this value went over the limit `spark.sql.codegen.hugeMethodLimit`" + + s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can make this limit higher." +logError(msg) +val maxLines = SQLConf.get.loggingMaxLinesForCodegen +logInfo(s"\n${CodeFormatter.format(code, maxLines)}") --- End diff -- We already have `logDebug` at line 1072. We don't need to log it again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142058550 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1020,6 +1006,10 @@ abstract class CodeGenerator[InType <: AnyRef, OutType <: AnyRef] extends Loggin } object CodeGenerator extends Logging { + + // This is the value of HugeMethodLimit in the OpenJDK JVM settings + val DEFAULT_OPENJDK_JVM_HUGE_METHOD_LIMIT = 8000 --- End diff -- ok, I'll update --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142058540 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging { logInfo(s"\n${CodeFormatter.format(code, maxLines)}") throw new CompileException(msg, e.getLocation) } + +// Check if compiled code has a too large function +methodsToByteCodeSize.foreach { case (name, byteCodeSize) => + if (byteCodeSize > SQLConf.get.hugeMethodLimit) { +val clazzName = evaluator.getClazz.getSimpleName +val methodName = name.replace("$", "") +val msg = s"failed to compile: the size of $clazzName.$methodName was $byteCodeSize and " + + "this value went over the limit `spark.sql.codegen.hugeMethodLimit`" + + s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can make this limit higher." +logError(msg) +val maxLines = SQLConf.get.loggingMaxLinesForCodegen +logInfo(s"\n${CodeFormatter.format(code, maxLines)}") --- End diff -- aha, I see. I also think this info is less meaning for users, but some meaningful for developers? So, how about changing to `logDebug`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142058489 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1020,6 +1006,10 @@ abstract class CodeGenerator[InType <: AnyRef, OutType <: AnyRef] extends Loggin } object CodeGenerator extends Logging { + + // This is the value of HugeMethodLimit in the OpenJDK JVM settings + val DEFAULT_OPENJDK_JVM_HUGE_METHOD_LIMIT = 8000 --- End diff -- I think just `DEFAULT_JVM_HUGE_METHOD_LIMIT` is ok. The comment already explains it is from OpenJDK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142058408 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging { logInfo(s"\n${CodeFormatter.format(code, maxLines)}") throw new CompileException(msg, e.getLocation) } + +// Check if compiled code has a too large function +methodsToByteCodeSize.foreach { case (name, byteCodeSize) => + if (byteCodeSize > SQLConf.get.hugeMethodLimit) { +val clazzName = evaluator.getClazz.getSimpleName +val methodName = name.replace("$", "") +val msg = s"failed to compile: the size of $clazzName.$methodName was $byteCodeSize and " + + "this value went over the limit `spark.sql.codegen.hugeMethodLimit`" + + s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can make this limit higher." +logError(msg) +val maxLines = SQLConf.get.loggingMaxLinesForCodegen +logInfo(s"\n${CodeFormatter.format(code, maxLines)}") +throw new CompileException(msg, null) --- End diff -- yea, I'll do so. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142058372 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging { logInfo(s"\n${CodeFormatter.format(code, maxLines)}") throw new CompileException(msg, e.getLocation) } + +// Check if compiled code has a too large function +methodsToByteCodeSize.foreach { case (name, byteCodeSize) => + if (byteCodeSize > SQLConf.get.hugeMethodLimit) { +val clazzName = evaluator.getClazz.getSimpleName +val methodName = name.replace("$", "") +val msg = s"failed to compile: the size of $clazzName.$methodName was $byteCodeSize and " + + "this value went over the limit `spark.sql.codegen.hugeMethodLimit`" + + s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can make this limit higher." +logError(msg) +val maxLines = SQLConf.get.loggingMaxLinesForCodegen +logInfo(s"\n${CodeFormatter.format(code, maxLines)}") +throw new CompileException(msg, null) --- End diff -- We should turn back to non-codegen execution for too long function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142058236 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging { logInfo(s"\n${CodeFormatter.format(code, maxLines)}") throw new CompileException(msg, e.getLocation) } + +// Check if compiled code has a too large function +methodsToByteCodeSize.foreach { case (name, byteCodeSize) => + if (byteCodeSize > SQLConf.get.hugeMethodLimit) { +val clazzName = evaluator.getClazz.getSimpleName +val methodName = name.replace("$", "") +val msg = s"failed to compile: the size of $clazzName.$methodName was $byteCodeSize and " + + "this value went over the limit `spark.sql.codegen.hugeMethodLimit`" + + s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can make this limit higher." +logError(msg) +val maxLines = SQLConf.get.loggingMaxLinesForCodegen +logInfo(s"\n${CodeFormatter.format(code, maxLines)}") --- End diff -- Is it useful to log the gen'd code for this case? The code gets compiled so it might not helpful to print out the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142057939 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -44,14 +66,24 @@ case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], chi val schemaOut = StructType.fromAttributes(output.drop(child.output.length).zipWithIndex .map { case (attr, i) => attr.withName(s"_$i") }) +val batchSize = conf.arrowMaxRecordsPerBatch + +val batchIter = if (batchSize > 0) { + new BatchIterator(iter, batchSize) +} else if (batchSize == 0) { + Iterator(iter) +} else { + throw new IllegalArgumentException(s"MaxRecordsPerBatch must be >= 0, but is $batchSize") --- End diff -- We should handle negative value of `conf.arrowMaxRecordsPerBatch` as the same as `0` according to the doc of the conf: > When using Apache Arrow, limit the maximum number of records that can be written to a single ArrowRecordBatch in memory. If set to zero or negative there is no limit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17702 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17702 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82378/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17702 **[Test build #82378 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82378/testReport)** for PR 17702 at commit [`d019656`](https://github.com/apache/spark/commit/d019656971069ebace05757dee56c214d143245c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142055482 --- Diff: python/pyspark/sql/functions.py --- @@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()): @since(2.3) def pandas_udf(f=None, returnType=StringType()): """ -Creates a :class:`Column` expression representing a user defined function (UDF) that accepts -`Pandas.Series` as input arguments and outputs a `Pandas.Series` of the same length. +Creates a :class:`Column` expression representing a vectorized user defined function (UDF). + +The user-defined function can define one of the following transformations: +1. One or more `pandas.Series` -> A `pandas.Series` + + This udf is used with `DataFrame.withColumn` and `DataFrame.select`. + The returnType should be a primitive data type, e.g., DoubleType() + + Example: + + >>> from pyspark.sql.types import IntegerType, StringType + >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) + >>> @pandas_udf(returnType=StringType()) + ... def to_upper(s): + ... return s.str.upper() + ... + >>> @pandas_udf(returnType="integer") + ... def add_one(x): + ... return x + 1 + ... + >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) + >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ + ... .show() # doctest: +SKIP + +--+--++ + |slen(name)|to_upper(name)|add_one(age)| + +--+--++ + | 8| JOHN DOE| 22| + +--+--++ + +2. A `pandas.DataFrame` -> A `pandas.DataFrame` + + This udf is used with `GroupedData.apply` + The returnType should be a StructType describing the schema of the returned + `pandas.DataFrame`. + + Example: + + >>> df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 4.0)], ("id", "v")) + >>> @pandas_udf(returnType=df.schema) + ... def normalize(df): + ... v = df.v + ... ret = df.assign(v=(v - v.mean()) / v.std()) --- End diff -- We need `return ret`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142055226 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala --- @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import scala.collection.JavaConverters._ + +import org.apache.spark.TaskContext +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, NamedExpression, SortOrder, UnsafeProjection} +import org.apache.spark.sql.catalyst.plans.physical.{ClusteredDistribution, Distribution, Partitioning} +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode} + +case class FlatMapGroupsInPandasExec( +grouping: Seq[Expression], +func: Expression, +override val output: Seq[Attribute], +override val child: SparkPlan +) extends UnaryExecNode { + + val groupingAttributes: Seq[Attribute] = grouping.map { +case ne: NamedExpression => ne.toAttribute + } + + private val pandasFunction = func.asInstanceOf[PythonUDF].func + + override def outputPartitioning: Partitioning = child.outputPartitioning + + override def producedAttributes: AttributeSet = AttributeSet(output) + + override def requiredChildDistribution: Seq[Distribution] = +ClusteredDistribution(groupingAttributes) :: Nil + + override def requiredChildOrdering: Seq[Seq[SortOrder]] = +Seq(groupingAttributes.map(SortOrder(_, Ascending))) + + override protected def doExecute(): RDD[InternalRow] = { +val inputRDD = child.execute() + +val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536) +val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true) +val chainedFunc = Seq(ChainedPythonFunctions(Seq(pandasFunction))) +val argOffsets = Array((0 until child.schema.length).toArray) + +inputRDD.mapPartitionsInternal { iter => + val grouped = GroupedIterator(iter, groupingAttributes, child.output) --- End diff -- I was thinking that the implementation on that time doesn't support grouping like: ```python df.groupby(col('id') % 2 == 0).apply(...) ``` but the change I supposed doesn't work either. The current implementation seems to not support the grouping above, though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19392: [SPARK-22169][SQL] table name with numbers and ch...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19392#discussion_r142054949 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -510,11 +510,15 @@ rowFormat ; tableIdentifier -: (db=identifier '.')? table=identifier +: (db=identifierPart '.')? table=identifierPart ; functionIdentifier -: (db=identifier '.')? function=identifier +: (db=identifierPart '.')? function=identifierPart +; + +identifierPart +: identifier | BIGINT_LITERAL | SMALLINT_LITERAL | TINYINT_LITERAL | BYTELENGTH_LITERAL --- End diff -- `INTEGER_VALUE`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18931 ping @cloud-fan @gatorsmile Please take a look for review. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142053683 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging { logInfo(s"\n${CodeFormatter.format(code, maxLines)}") throw new CompileException(msg, e.getLocation) } + +// Check if compiled code has a too large function +methodsToByteCodeSize.foreach { case (name, byteCodeSize) => + if (byteCodeSize > SQLConf.get.hugeMethodLimit) { +val clazzName = evaluator.getClazz.getSimpleName +val methodName = name.replace("$", "") +val msg = s"failed to compile: the size of $clazzName.$methodName was $byteCodeSize and " + + "this value went over the limit `spark.sql.codegen.hugeMethodLimit`" + + s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can make this limit higher." +logError(msg) +val maxLines = SQLConf.get.loggingMaxLinesForCodegen +logInfo(s"\n${CodeFormatter.format(code, maxLines)}") +throw new CompileException(msg, null) --- End diff -- ok, I try to refactor this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142053181 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1092,13 +1082,30 @@ object CodeGenerator extends Logging { logInfo(s"\n${CodeFormatter.format(code, maxLines)}") throw new CompileException(msg, e.getLocation) } + +// Check if compiled code has a too large function +methodsToByteCodeSize.foreach { case (name, byteCodeSize) => + if (byteCodeSize > SQLConf.get.hugeMethodLimit) { +val clazzName = evaluator.getClazz.getSimpleName +val methodName = name.replace("$", "") +val msg = s"failed to compile: the size of $clazzName.$methodName was $byteCodeSize and " + + "this value went over the limit `spark.sql.codegen.hugeMethodLimit`" + + s"(${SQLConf.get.hugeMethodLimit}). To avoid this error, you can make this limit higher." +logError(msg) +val maxLines = SQLConf.get.loggingMaxLinesForCodegen +logInfo(s"\n${CodeFormatter.format(code, maxLines)}") +throw new CompileException(msg, null) --- End diff -- Previously, it works although it does not perform well. Now, after this change, the query does not work. This could causes regressions. I think we should make a change to automatically fall back to the slow mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18732 @HyukjinKwon Thanks for the feedback. I will address those and update tomorrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18732 @rxin, `transform` takes a function: pd.Series -> pd.Series and apply the function on all columns: ``` df.show() id v1 v2 v3 a 1.0 4.0 0.0 a 2.0 5.0 1.0 a 3.0 6.0 1.0 df.groupby('id').transform(pandas_udf(lambda v: v - v.mean(), DoubleType())).show() id v1 v2v3 a -1.0 -1.0-0.67 a 0.0 0.0 0.33 a 1.0 1.0 0.33 ``` This is mimicking `pd.DataFrame.groupby().transform` `apply` takes a function: pd.DataFrame -> pd.DataFrame and is similar to `flatMapGroups` The name `apply` is originated from the R paper "The Split-Apply-Combine Strategy for Data Analysis" and is used in both pandas and R to describe this function, so the name `apply` should be pretty straight forward to pandas/python user. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10% percentage. Without this fix, it returned 2. I've updated description by adding this example in JIRA and PR, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19083 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82375/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19083 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19083 **[Test build #82375 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82375/testReport)** for PR 19083 at commit [`1fa7c1c`](https://github.com/apache/spark/commit/1fa7c1cac9e4711818df6957260acb751b6dd8b2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17862 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17862 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82377/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17862 **[Test build #82377 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82377/testReport)** for PR 17862 at commit [`1f8e984`](https://github.com/apache/spark/commit/1f8e98411401add53dddbb13e94f450c27b7c0fd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17702 **[Test build #82378 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82378/testReport)** for PR 17702 at commit [`d019656`](https://github.com/apache/spark/commit/d019656971069ebace05757dee56c214d143245c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/17702 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...
Github user obermeier commented on a diff in the pull request: https://github.com/apache/spark/pull/19408#discussion_r142047496 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -981,7 +981,13 @@ private[spark] object Utils extends Logging { return cached } -val indx: Int = hostPort.lastIndexOf(':') +val indx: Int = + // Interpret hostPort as literal IPv6 address if it contains two ore more colons + // scalastyle:off SingleSpaceBetweenRParenAndLCurlyBrace + if (hostPort.matches("(([0-9a-f]*):([0-9a-f]*)){2,}")) -1 --- End diff -- Yes a real parser would be much better!! I hope the methods will check the input. Like the name resolver... At this point I thought more about the separation of the port. I think it is important to check if two colons exists, otherwise this expression accepts hostnames like abc:123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 @hvanhovell could you please review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18704: [SPARK-20783][SQL] Create ColumnVector to abstract exist...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/18704 @cloud-fan could you please review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17862 **[Test build #82377 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82377/testReport)** for PR 17862 at commit [`1f8e984`](https://github.com/apache/spark/commit/1f8e98411401add53dddbb13e94f450c27b7c0fd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17862 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17862 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82376/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17862 **[Test build #82376 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82376/testReport)** for PR 17862 at commit [`bf4d955`](https://github.com/apache/spark/commit/bf4d955082ce94028965eb034efab74f6ff8d0b8). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17862 **[Test build #82376 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82376/testReport)** for PR 17862 at commit [`bf4d955`](https://github.com/apache/spark/commit/bf4d955082ce94028965eb034efab74f6ff8d0b8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19083 **[Test build #82375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82375/testReport)** for PR 19083 at commit [`1fa7c1c`](https://github.com/apache/spark/commit/1fa7c1cac9e4711818df6957260acb751b6dd8b2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17359: [SPARK-20028][SQL] Add aggreagate expression nGrams
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17359 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19403: [R][BUILD][WIP] test
Github user felixcheung closed the pull request at: https://github.com/apache/spark/pull/19403 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19408#discussion_r142038575 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -981,7 +981,13 @@ private[spark] object Utils extends Logging { return cached } -val indx: Int = hostPort.lastIndexOf(':') +val indx: Int = + // Interpret hostPort as literal IPv6 address if it contains two ore more colons + // scalastyle:off SingleSpaceBetweenRParenAndLCurlyBrace + if (hostPort.matches("(([0-9a-f]*):([0-9a-f]*)){2,}")) -1 + // Else last colon defines start of port definition + else hostPort.lastIndexOf(':') --- End diff -- Final nit, use braces for both parts of the if-else --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19408#discussion_r142038567 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -981,7 +981,13 @@ private[spark] object Utils extends Logging { return cached } -val indx: Int = hostPort.lastIndexOf(':') +val indx: Int = + // Interpret hostPort as literal IPv6 address if it contains two ore more colons + // scalastyle:off SingleSpaceBetweenRParenAndLCurlyBrace + if (hostPort.matches("(([0-9a-f]*):([0-9a-f]*)){2,}")) -1 --- End diff -- Hex digits can be uppercase right? Should the pattern it not be more like `[0-9a-f]*(:[0-9a-f]*)+` match a number, then colon-number colon-number pairs, not number-colon-number number-colon-number sequences? It might end up being equivalent because the match is for 0 or more digits. This allows some strings that it shouldn't like "", but, the purpose isn't to catch every possible case I guess. It would fail name resolution. I thought Inet6Address would just provide parsing for this but I guess not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19408#discussion_r142038556 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -981,7 +981,13 @@ private[spark] object Utils extends Logging { return cached } -val indx: Int = hostPort.lastIndexOf(':') +val indx: Int = + // Interpret hostPort as literal IPv6 address if it contains two ore more colons + // scalastyle:off SingleSpaceBetweenRParenAndLCurlyBrace --- End diff -- Turn the rule back on after the block? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6 address in org.apa...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19408#discussion_r142038519 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -981,7 +981,13 @@ private[spark] object Utils extends Logging { return cached } -val indx: Int = hostPort.lastIndexOf(':') +val indx: Int = + // Interpret hostPort as literal IPv6 address if it contains two ore more colons --- End diff -- Nit: ore You might note here that you're checking that you _don't_ have a `[::1]:123` IPv6 address here -- the braces are key. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r142037322 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala --- @@ -301,10 +301,10 @@ class AggregateBenchmark extends BenchmarkBase { */ } - ignore("max function length of wholestagecodegen") { + test("max function length of codegen") { --- End diff -- This should be "ignore" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19083 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19083 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82373/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19083 **[Test build #82373 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82373/testReport)** for PR 19083 at commit [`8744566`](https://github.com/apache/spark/commit/874456648732309b35cefe36d45783eed77ee7b0). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19229: [SPARK-22001][ML][SQL] ImputerModel can do withCo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19229 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19405: [SPARK-22178] [SQL] Refresh Persistent Views by REFRESH ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19405 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19229 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19401: [SPARK-22176][SQL] Fix overflow issue in Dataset....
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19401#discussion_r142035876 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -237,7 +237,7 @@ class Dataset[T] private[sql]( */ private[sql] def showString( _numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = { -val numRows = _numRows.max(0) +val numRows = _numRows.max(0).min(Int.MaxValue - 1) --- End diff -- Spark SQL does not work when the number of rows is close to `Int.MaxValue`. The driver will be OOM before finishing the command. Thus, I do not think we can hit this extreme case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19083 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19083 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82374/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19083 **[Test build #82374 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82374/testReport)** for PR 19083 at commit [`08d31f3`](https://github.com/apache/spark/commit/08d31f3920edf32353298b5067e083c2643917db). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19083 **[Test build #82374 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82374/testReport)** for PR 19083 at commit [`08d31f3`](https://github.com/apache/spark/commit/08d31f3920edf32353298b5067e083c2643917db). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19083: [SPARK-21871][SQL] Check actual bytecode size when compi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19083 **[Test build #82373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82373/testReport)** for PR 19083 at commit [`8744566`](https://github.com/apache/spark/commit/874456648732309b35cefe36d45783eed77ee7b0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19181 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82372/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19181 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19181 **[Test build #82372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82372/testReport)** for PR 19181 at commit [`e3da029`](https://github.com/apache/spark/commit/e3da02911ec67ba586b9d7db5a87251f39f06cb5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19408: [SPARK-22180][CORE] Allow IPv6
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19408 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19408: [SPARK-22180][CORE] Allow IPv6
GitHub user obermeier opened a pull request: https://github.com/apache/spark/pull/19408 [SPARK-22180][CORE] Allow IPv6 External applications like Apache Cassandra are able to deal with IPv6 addresses. Libraries like spark-cassandra-connector combine Apache Cassandra with Apache Spark. This combination is very useful IMHO. One problem is that `org.apache.spark.util.Utils.parseHostPort(hostPort:` `String)` takes the last colon to sepperate the port from host path. This conflicts with literal IPv6 addresses. I think we can take `hostPort` as literal IPv6 address if it contains tow ore more colons. If IPv6 addresses are enclosed in square brackets port definition is still possible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/obermeier/spark issue/SPARK-22180 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19408.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19408 commit 3783f85b18540ed8746c078ad9c2f12d7167be9d Author: Stefan ObermeierDate: 2017-10-01T14:28:58Z [SPARK-22180][CORE] Allow IPv6 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19399: [SPARK-22175][WEB-UI] Add status column to history page
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19399 @jerryshao @squito Could you help review this?Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19401: [SPARK-22176][SQL] Fix overflow issue in Dataset....
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19401#discussion_r142031417 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -237,7 +237,7 @@ class Dataset[T] private[sql]( */ private[sql] def showString( _numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = { -val numRows = _numRows.max(0) +val numRows = _numRows.max(0).min(Int.MaxValue - 1) --- End diff -- hmm, I see. Both is okay to me and WDYT? cc: @gatorsmile IMHO it might be still okay to set `[0, Int.MaxValue)` as valid range for `show` cuz this is a super extreme corner case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142029595 --- Diff: python/pyspark/sql/functions.py --- @@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()): @since(2.3) def pandas_udf(f=None, returnType=StringType()): """ -Creates a :class:`Column` expression representing a user defined function (UDF) that accepts -`Pandas.Series` as input arguments and outputs a `Pandas.Series` of the same length. +Creates a :class:`Column` expression representing a vectorized user defined function (UDF). + +The user-defined function can define one of the following transformations: +1. One or more `pandas.Series` -> A `pandas.Series` --- End diff -- Looks we should leave a newline for python doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142029720 --- Diff: python/pyspark/sql/functions.py --- @@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()): @since(2.3) def pandas_udf(f=None, returnType=StringType()): """ -Creates a :class:`Column` expression representing a user defined function (UDF) that accepts -`Pandas.Series` as input arguments and outputs a `Pandas.Series` of the same length. +Creates a :class:`Column` expression representing a vectorized user defined function (UDF). + +The user-defined function can define one of the following transformations: +1. One or more `pandas.Series` -> A `pandas.Series` + + This udf is used with `DataFrame.withColumn` and `DataFrame.select`. + The returnType should be a primitive data type, e.g., DoubleType() --- End diff -- little nit: `` `DoubleType()` `` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142029736 --- Diff: python/pyspark/sql/group.py --- @@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col, values) return GroupedData(jgd, self.sql_ctx) +def apply(self, udf_obj): +""" +Maps each group of the current [[DataFrame]] using a pandas udf and returns the result --- End diff -- little nit: `[[DataFrame]]` to ``:class:`DataFrame` `` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142029696 --- Diff: python/pyspark/sql/functions.py --- @@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()): @since(2.3) def pandas_udf(f=None, returnType=StringType()): """ -Creates a :class:`Column` expression representing a user defined function (UDF) that accepts -`Pandas.Series` as input arguments and outputs a `Pandas.Series` of the same length. +Creates a :class:`Column` expression representing a vectorized user defined function (UDF). + +The user-defined function can define one of the following transformations: +1. One or more `pandas.Series` -> A `pandas.Series` + + This udf is used with `DataFrame.withColumn` and `DataFrame.select`. + The returnType should be a primitive data type, e.g., DoubleType() + + Example: + + >>> from pyspark.sql.types import IntegerType, StringType + >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) + >>> @pandas_udf(returnType=StringType()) + ... def to_upper(s): + ... return s.str.upper() + ... + >>> @pandas_udf(returnType="integer") + ... def add_one(x): + ... return x + 1 + ... + >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) + >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ + ... .show() # doctest: +SKIP + +--+--++ + |slen(name)|to_upper(name)|add_one(age)| + +--+--++ + | 8| JOHN DOE| 22| + +--+--++ + +2. A `pandas.DataFrame` -> A `pandas.DataFrame` --- End diff -- This looks producing an warning here: ``` spark/python/pyspark/sql/functions.py:docstring of pyspark.sql.functions.pandas_udf:46: WARNING: Enumerated list ends without a blank line; unexpected unindent. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142029802 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import scala.collection.JavaConverters._ + +import org.apache.spark.TaskContext +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, NamedExpression, SortOrder, UnsafeProjection} +import org.apache.spark.sql.catalyst.plans.physical.{ClusteredDistribution, Distribution, Partitioning} +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode} + +case class FlatMapGroupsInPandasExec( +groupingAttributes: Seq[Attribute], +func: Expression, +output: Seq[Attribute], +child: SparkPlan) + extends UnaryExecNode { + + private val pandasFunction = func.asInstanceOf[PythonUDF].func + + override def outputPartitioning: Partitioning = child.outputPartitioning + + override def producedAttributes: AttributeSet = AttributeSet(output) + + override def requiredChildDistribution: Seq[Distribution] = +ClusteredDistribution(groupingAttributes) :: Nil + + override def requiredChildOrdering: Seq[Seq[SortOrder]] = +Seq(groupingAttributes.map(SortOrder(_, Ascending))) + + override protected def doExecute(): RDD[InternalRow] = { +val inputRDD = child.execute() + +val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536) +val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true) +val chainedFunc = Seq(ChainedPythonFunctions(Seq(pandasFunction))) +val argOffsets = Array((0 until child.schema.length).toArray) + +inputRDD.mapPartitionsInternal { iter => + val grouped = GroupedIterator(iter, groupingAttributes, child.output) + val context = TaskContext.get() + + val columnarBatchIter = new ArrowPythonRunner( +chainedFunc, bufferSize, reuseWorker, +PythonEvalType.SQL_PANDAS_UDF, argOffsets, child.schema) +.compute(grouped.map(_._2), context.partitionId(), context) + + val vectorRowIter = new Iterator[InternalRow] { +private var currentIter = if (columnarBatchIter.hasNext) { + val batch = columnarBatchIter.next() + // assert(schemaOut.equals(batch.schema), + // s"Invalid schema from pandas_udf: expected $schemaOut, got ${batch.schema}") --- End diff -- Let's avoid commented codes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142029714 --- Diff: python/pyspark/sql/functions.py --- @@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()): @since(2.3) def pandas_udf(f=None, returnType=StringType()): """ -Creates a :class:`Column` expression representing a user defined function (UDF) that accepts -`Pandas.Series` as input arguments and outputs a `Pandas.Series` of the same length. +Creates a :class:`Column` expression representing a vectorized user defined function (UDF). + +The user-defined function can define one of the following transformations: +1. One or more `pandas.Series` -> A `pandas.Series` + + This udf is used with `DataFrame.withColumn` and `DataFrame.select`. + The returnType should be a primitive data type, e.g., DoubleType() + + Example: + + >>> from pyspark.sql.types import IntegerType, StringType + >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) + >>> @pandas_udf(returnType=StringType()) + ... def to_upper(s): + ... return s.str.upper() + ... + >>> @pandas_udf(returnType="integer") + ... def add_one(x): + ... return x + 1 + ... + >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) + >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ + ... .show() # doctest: +SKIP + +--+--++ + |slen(name)|to_upper(name)|add_one(age)| + +--+--++ + | 8| JOHN DOE| 22| + +--+--++ + +2. A `pandas.DataFrame` -> A `pandas.DataFrame` + + This udf is used with `GroupedData.apply` + The returnType should be a StructType describing the schema of the returned + `pandas.DataFrame`. + + Example: + + >>> df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 4.0)], ("id", "v")) + >>> @pandas_udf(returnType=df.schema) + ... def normalize(df): + ... v = df.v + ... ret = df.assign(v=(v - v.mean()) / v.std()) + >>> df.groupby('id').apply(normalize).show() # doctest: + SKIP + +---+---+ + | id| v| + +---+---+ + | 1|-0.7071067811865475| + | 1| 0.7071067811865475| + | 2|-0.7071067811865475| + | 2| 0.7071067811865475| + +---+---+ --- End diff -- This produces Python doc output as below: https://user-images.githubusercontent.com/6477701/31054996-b6f5737a-a6f8-11e7-9559-239ab74b1cb4.png;> Sounds related with: ``` spark/python/pyspark/sql/functions.py:docstring of pyspark.sql.functions.pandas_udf:46: WARNING: Enumerated list ends without a blank line; unexpected unindent. ``` warning above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142029786 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import scala.collection.JavaConverters._ + +import org.apache.spark.TaskContext +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, NamedExpression, SortOrder, UnsafeProjection} --- End diff -- FYI, we could use just wild cardimport if it exceed 6 instances - https://github.com/databricks/scala-style-guide#imports --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142029655 --- Diff: python/pyspark/sql/functions.py --- @@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()): @since(2.3) def pandas_udf(f=None, returnType=StringType()): """ -Creates a :class:`Column` expression representing a user defined function (UDF) that accepts -`Pandas.Series` as input arguments and outputs a `Pandas.Series` of the same length. +Creates a :class:`Column` expression representing a vectorized user defined function (UDF). + +The user-defined function can define one of the following transformations: +1. One or more `pandas.Series` -> A `pandas.Series` + + This udf is used with `DataFrame.withColumn` and `DataFrame.select`. + The returnType should be a primitive data type, e.g., DoubleType() + + Example: + + >>> from pyspark.sql.types import IntegerType, StringType + >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) + >>> @pandas_udf(returnType=StringType()) + ... def to_upper(s): + ... return s.str.upper() + ... + >>> @pandas_udf(returnType="integer") + ... def add_one(x): + ... return x + 1 + ... + >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) + >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ + ... .show() # doctest: +SKIP + +--+--++ + |slen(name)|to_upper(name)|add_one(age)| + +--+--++ + | 8| JOHN DOE| 22| + +--+--++ + +2. A `pandas.DataFrame` -> A `pandas.DataFrame` + + This udf is used with `GroupedData.apply` + The returnType should be a StructType describing the schema of the returned + `pandas.DataFrame`. + + Example: + + >>> df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 4.0)], ("id", "v")) + >>> @pandas_udf(returnType=df.schema) + ... def normalize(df): + ... v = df.v + ... ret = df.assign(v=(v - v.mean()) / v.std()) --- End diff -- little nit: ```python _ = df.assign(v=(v - v.mean()) / v.std()) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19181 **[Test build #82372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82372/testReport)** for PR 19181 at commit [`e3da029`](https://github.com/apache/spark/commit/e3da02911ec67ba586b9d7db5a87251f39f06cb5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19407 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82371/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19407 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19407 **[Test build #82371 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82371/testReport)** for PR 19407 at commit [`44b6ce2`](https://github.com/apache/spark/commit/44b6ce23dfc32c19928533ab7d3f6916a4268562). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #3940 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)** for PR 19404 at commit [`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 So, @felixcheung, @shaneknapp and @shivaram, looks we have comments, https://github.com/apache/spark/pull/19290#issuecomment-21991 and https://github.com/apache/spark/pull/19290#issuecomment-22653, to discuss further. The problem looks upgrading `lintr` to `jimhester/lintr@5431140` could break other builds in other branches. Should I rather try to install this for each build, for example, via checking the environment variables as a workaround? I checked those before - https://github.com/apache/spark/pull/17669#issue-222376466 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19290: [SPARK-22063][R] Fixes lint check failures in R b...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19290 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19290 Let me merge this one first. This shouldn't cause any problem to built system for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19402: [SPARK-22167][R][BUILD] sparkr packaging issue al...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19402#discussion_r142024980 --- Diff: core/pom.xml --- @@ -499,7 +499,7 @@ - ..${file.separator}R${file.separator}install-dev${script.extension} + ${project.basedir}${file.separator}..${file.separator}R${file.separator}install-dev${script.extension} --- End diff -- Looks this tab is inserted mistakenly BTW. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19229 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82369/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19229 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 The JIRA doesn't explain what this is meant to fix. What case does this help get more correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19229 **[Test build #82369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82369/testReport)** for PR 19229 at commit [`1292ce0`](https://github.com/apache/spark/commit/1292ce01ccf24eb40638e748a9cb0b0dbabbb72c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #3940 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)** for PR 19404 at commit [`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19401: [SPARK-22176][SQL] Fix overflow issue in Dataset....
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19401#discussion_r142023186 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -237,7 +237,7 @@ class Dataset[T] private[sql]( */ private[sql] def showString( _numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = { -val numRows = _numRows.max(0) +val numRows = _numRows.max(0).min(Int.MaxValue - 1) --- End diff -- OK, but now you return one fewer row than expected when it's possible to return Int.MaxValue. Granted this is an extreme corner case, but that seems less compelling than just skipping the display of "more elements" in this case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19407: [SPARK-21667][Streaming] ConsoleSink should not f...
Github user rekhajoshm commented on a diff in the pull request: https://github.com/apache/spark/pull/19407#discussion_r142023140 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala --- @@ -269,7 +269,7 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) { } else { val (useTempCheckpointLocation, recoverFromCheckpointLocation) = if (source == "console") { - (true, false) + (true, true) --- End diff -- Good point @jaceklaskowski updated.thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19407 **[Test build #82371 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82371/testReport)** for PR 19407 at commit [`44b6ce2`](https://github.com/apache/spark/commit/44b6ce23dfc32c19928533ab7d3f6916a4268562). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19407 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82370/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19407 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19407 **[Test build #82370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82370/testReport)** for PR 19407 at commit [`57e0e26`](https://github.com/apache/spark/commit/57e0e26474b66afd3bd54be061a5982836e28792). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19407: [SPARK-21667][Streaming] ConsoleSink should not f...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/19407#discussion_r142022819 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala --- @@ -269,7 +269,7 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) { } else { val (useTempCheckpointLocation, recoverFromCheckpointLocation) = if (source == "console") { - (true, false) + (true, true) --- End diff -- Is there any source that uses `recoverFromCheckpointLocation` disabled? What's the use case if any? Remove `recoverFromCheckpointLocation` here as it's always `true` and make it explicit. The JIRA issue is to fix the exception followed by cleaning the code that was needed in the past. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19407 **[Test build #82370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82370/testReport)** for PR 19407 at commit [`57e0e26`](https://github.com/apache/spark/commit/57e0e26474b66afd3bd54be061a5982836e28792). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19407: [SPARK-21667][Streaming] ConsoleSink should not f...
GitHub user rekhajoshm opened a pull request: https://github.com/apache/spark/pull/19407 [SPARK-21667][Streaming] ConsoleSink should not fail streaming query with checkpointLocation option ## What changes were proposed in this pull request? Fix to allow recovery on console , avoid checkpoint exception ## How was this patch tested? existing tests manual tests [ Replicating error and seeing no checkpoint error after fix] You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/spark SPARK-21667 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19407.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19407 commit e3677c9fa9697e0d34f9df52442085a6a481c9e9 Author: Rekha JoshiDate: 2015-05-05T23:10:08Z Merge pull request #1 from apache/master Pulling functionality from apache spark commit 106fd8eee8f6a6f7c67cfc64f57c1161f76d8f75 Author: Rekha Joshi Date: 2015-05-08T21:49:09Z Merge pull request #2 from apache/master pull latest from apache spark commit 0be142d6becba7c09c6eba0b8ea1efe83d649e8c Author: Rekha Joshi Date: 2015-06-22T00:08:08Z Merge pull request #3 from apache/master Pulling functionality from apache spark commit 6c6ee12fd733e3f9902e10faf92ccb78211245e3 Author: Rekha Joshi Date: 2015-09-17T01:03:09Z Merge pull request #4 from apache/master Pulling functionality from apache spark commit b123c601e459d1ad17511fd91dd304032154882a Author: Rekha Joshi Date: 2015-11-25T18:50:32Z Merge pull request #5 from apache/master pull request from apache/master commit c73c32aadd6066e631956923725a48d98a18777e Author: Rekha Joshi Date: 2016-03-18T19:13:51Z Merge pull request #6 from apache/master pull latest from apache spark commit 7dbf7320057978526635bed09dabc8cf8657a28a Author: Rekha Joshi Date: 2016-04-05T20:26:40Z Merge pull request #8 from apache/master pull latest from apache spark commit 5e9d71827f8e2e4d07027281b80e4e073e7fecd1 Author: Rekha Joshi Date: 2017-05-01T23:00:30Z Merge pull request #9 from apache/master Pull apache spark commit 63d99b3ce5f222d7126133170a373591f0ac67dd Author: Rekha Joshi Date: 2017-09-30T22:26:44Z Merge pull request #10 from apache/master pull latest apache spark commit 57e0e26474b66afd3bd54be061a5982836e28792 Author: rjoshi2 Date: 2017-10-01T06:57:12Z [SPARK-21667][Streaming] ConsoleSink should not fail streaming query with checkpointLocation option --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org