[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14959 **[Test build #66758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66758/consoleFull)** for PR 14959 at commit [`d692e71`](https://github.com/apache/spark/commit/d692e715e9c4da3a00f18c1693b60772492a80e4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/14959 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13194: [SPARK-15402] [ML] [PySpark] PySpark ml.evaluatio...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/13194#discussion_r82877590 --- Diff: python/pyspark/ml/evaluation.py --- @@ -311,19 +330,25 @@ def setParams(self, predictionCol="prediction", labelCol="label", if __name__ == "__main__": import doctest +import tempfile +import pyspark.ml.evaluation from pyspark.sql import SparkSession -globs = globals().copy() +globs = pyspark.ml.evaluation.__dict__.copy() # The small batch size here ensures that we see multiple batches, # even in these small test examples: spark = SparkSession.builder\ .master("local[2]")\ .appName("ml.evaluation tests")\ .getOrCreate() -sc = spark.sparkContext -globs['sc'] = sc globs['spark'] = spark -(failure_count, test_count) = doctest.testmod( -globs=globs, optionflags=doctest.ELLIPSIS) -spark.stop() -if failure_count: -exit(-1) --- End diff -- should there still be an `exit(-1)` when failures? If not then you can remove the returned variables from `doctest.testmod` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...
Github user jodersky commented on a diff in the pull request: https://github.com/apache/spark/pull/15338#discussion_r82866433 --- Diff: sbin/spark-daemon.sh --- @@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then export SPARK_NICENESS=0 fi +execute_command() { + command="$@" + if [ "$SPARK_NO_DAEMONIZE" != "" ]; then --- End diff -- this only checks if the variable is non-empty. Setting it to the empty string will not work. E.g. `export SPARK_NO_DAEMONIZE=''` will not satisfy the condition. Something like `if [ -z ${SPARK_NO_DAEMONIZE+set} ];` will also handle the null-case (see [this SO thread](http://stackoverflow.com/questions/3601515/how-to-check-if-a-variable-is-set-in-bash/3601734#3601734)). Another, more elegant, solution would be to use the `[[ -v SPARK_NO_DAEMONIZE]]`, however that requires at least bash 4.2 and is probably not available on all systems that spark runs on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...
Github user jodersky commented on a diff in the pull request: https://github.com/apache/spark/pull/15338#discussion_r82874206 --- Diff: sbin/spark-daemon.sh --- @@ -146,13 +176,11 @@ run_command() { case "$mode" in (class) - nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class $command "$@" >> "$log" 2>&1 < /dev/null & - newpid="$!" + execute_command "nice -n \"$SPARK_NICENESS\" \"${SPARK_HOME}/bin/spark-class\" $command $@" ;; (submit) - nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-submit --class $command "$@" >> "$log" 2>&1 < /dev/null & - newpid="$!" + execute_command "nice -n \"$SPARK_NICENESS\" \"${SPARK_HOME}/bin/spark-submit\" --class $command $@" --- End diff -- same as above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...
Github user jodersky commented on a diff in the pull request: https://github.com/apache/spark/pull/15338#discussion_r82874084 --- Diff: sbin/spark-daemon.sh --- @@ -146,13 +176,11 @@ run_command() { case "$mode" in (class) - nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class $command "$@" >> "$log" 2>&1 < /dev/null & - newpid="$!" + execute_command "nice -n \"$SPARK_NICENESS\" \"${SPARK_HOME}/bin/spark-class\" $command $@" --- End diff -- the outer-most quotes aren't required without the eval. `execute_command nice -n "$SPARK_NICENESS" "${SPARK_HOME}/bin/spark-class" $command $@` should do the trick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...
Github user jodersky commented on a diff in the pull request: https://github.com/apache/spark/pull/15338#discussion_r82875044 --- Diff: sbin/spark-daemon.sh --- @@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then export SPARK_NICENESS=0 fi +execute_command() { + command="$@" --- End diff -- spark scripts require bash as interpreter, so we can make this a local variable --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...
Github user jodersky commented on a diff in the pull request: https://github.com/apache/spark/pull/15338#discussion_r82871978 --- Diff: sbin/spark-daemon.sh --- @@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then export SPARK_NICENESS=0 fi +execute_command() { + command="$@" + if [ "$SPARK_NO_DAEMONIZE" != "" ]; then + eval $command --- End diff -- I don't think eval is required here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...
Github user jodersky commented on a diff in the pull request: https://github.com/apache/spark/pull/15338#discussion_r82872476 --- Diff: sbin/spark-daemon.sh --- @@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then export SPARK_NICENESS=0 fi +execute_command() { + command="$@" + if [ "$SPARK_NO_DAEMONIZE" != "" ]; then + eval $command + else + eval "nohup $command >> \"$log\" 2>&1 < /dev/null &" --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66757/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15307 **[Test build #66757 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66757/consoleFull)** for PR 15307 at commit [`dca9939`](https://github.com/apache/spark/commit/dca9939487e77bd739c08149df495b89e5898656). * This patch **fails Python style tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15410: [SPARK-17843][Web UI] Indicate event logs pending for pr...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/15410 Just to raise an idea that would possibly mean less code change, would simply having a flag that causing a "currently processing applications" type message to display without an actual count with it? Overall I think this is a good addition though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15307 **[Test build #66757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66757/consoleFull)** for PR 15307 at commit [`dca9939`](https://github.com/apache/spark/commit/dca9939487e77bd739c08149df495b89e5898656). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82876529 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -412,6 +419,63 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class + * @param name + * @param className + * @param returnType + */ + def registerJava(name: String, className: String, returnType: DataType): Unit = { --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82876442 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -412,6 +419,63 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends // /** + * Register a Java UDF class + * @param name + * @param className + * @param returnType + */ + def registerJava(name: String, className: String, returnType: DataType): Unit = { + +try { + // scalastyle:off classforname --- End diff -- This style rule is here to prevent misuse. Is there a reason we aren't using our utility functions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82876419 --- Diff: python/pyspark/sql/context.py --- @@ -202,6 +202,26 @@ def registerFunction(self, name, f, returnType=StringType()): """ self.sparkSession.catalog.registerFunction(name, f, returnType) +@ignore_unicode_prefix +@since(2.1) +def registerJavaFunction(self, name, javaClassName, returnType=StringType()): +"""Register a java UDF so it can be used in SQL statements. + +In addition to a name and the function itself, the return type can be optionally specified. +When the return type is not given it default to a string and conversion will automatically --- End diff -- Where does this conversion happen? Are we sure that it works (given there are no tests that I see). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82876509 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -17,9 +17,15 @@ package org.apache.spark.sql + +import java.io.IOException +import java.util.{List => JList, Map => JMap} + import scala.reflect.runtime.universe.TypeTag import scala.util.Try +import sun.reflect.generics.reflectiveObjects.ParameterizedTypeImpl --- End diff -- Is this JVM specific? What is this being used for? Is there another way? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r82875191 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -225,13 +225,16 @@ case class FileSourceScanExec( } // These metadata values make scan plans uniquely identifiable for equality checking. - override val metadata: Map[String, String] = Map( -"Format" -> relation.fileFormat.toString, -"ReadSchema" -> outputSchema.catalogString, -"Batched" -> supportsBatch.toString, -"PartitionFilters" -> partitionFilters.mkString("[", ", ", "]"), -"PushedFilters" -> dataFilters.mkString("[", ", ", "]"), -"InputPaths" -> relation.location.paths.mkString(", ")) + override val metadata: Map[String, String] = { +def seqToString(seq: Seq[Any]) = seq.mkString("[", ", ", "]") +val location = relation.location +Map("Format" -> relation.fileFormat.toString, + "ReadSchema" -> outputSchema.catalogString, + "Batched" -> supportsBatch.toString, + "PartitionFilters" -> seqToString(partitionFilters), + "PushedFilters" -> seqToString(dataFilters), + "Location" -> (location.getClass.getSimpleName + seqToString(location.rootPaths))) --- End diff -- Is it different from location.getRootPaths.length? I guess that would only be meaningful for Hive-backed tables, but that seems ok. Btw, you might want to take(10) on the root paths to avoid materializing a large string here before it gets truncated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66750/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15425 LGTM. Merging to 2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15375 **[Test build #66750 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66750/consoleFull)** for PR 15375 at commit [`836e874`](https://github.com/apache/spark/commit/836e8745c346c59f78958e10aec1c6f9537242b9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15431: [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiv...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15431 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15431: [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiveBayes ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15431 I'll go ahead and merge with master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82872469 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -176,7 +184,9 @@ class StreamExecution( // Mark ACTIVE and then post the event. QueryStarted event is synchronously sent to listeners, // so must mark this as ACTIVE first. state = ACTIVE - postEvent(new QueryStarted(this.toInfo)) // Assumption: Does not throw exception. + sparkSession.sparkContext.env.metricsSystem.registerSource(streamMetrics) + updateStatus() + postEvent(new QueryStarted(currentStatus)) // Assumption: Does not throw exception. --- End diff -- it's good to be consistent. Otherwise if you want to change the behavior of status for example, then this line will not be left orphaned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14690 **[Test build #66751 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66751/consoleFull)** for PR 14690 at commit [`2762efd`](https://github.com/apache/spark/commit/2762efd925325a9ab732b34e6672b704dc514062). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14690 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14690 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66751/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15408 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15408 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66749/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15408 **[Test build #66749 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66749/consoleFull)** for PR 15408 at commit [`30173fa`](https://github.com/apache/spark/commit/30173facf79e03469291199807f84368a320e262). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user Yunni commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82871238 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,146 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import breeze.linalg.normalize + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators} +import org.apache.spark.ml.param.shared.HasSeed +import org.apache.spark.ml.util.{Identifiable, SchemaUtils} +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: + * Params for [[RandomProjection]]. + */ +@Since("2.1.0") +private[ml] trait RandomProjectionParams extends Params { + + /** + * The length of each hash bucket, a larger bucket lowers the false negative rate. + * + * If input vectors are normalized, 1-10 times of pow(numRecords, -1/inputDim) would be a + * reasonable value + * @group param + */ + @Since("2.1.0") + val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength", +"the length of each hash bucket, a larger bucket lowers the false negative rate.", +ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getBucketLength: Double = $(bucketLength) +} + +/** + * :: Experimental :: + * Model produced by [[RandomProjection]] + * @param randUnitVectors An array of random unit vectors. Each vector represents a hash function. + */ +@Experimental +@Since("2.1.0") +class RandomProjectionModel private[ml] ( +override val uid: String, +val randUnitVectors: Array[Vector]) --- End diff -- I wanted to use `Matrix`. But it turns out that `Array[Vector]` makes normalizing each vector, and taking floor of the values much cleaner. Is there any reason we should use `Matrix` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15335: [SPARK-17769][Core][Scheduler]Some FetchFailure r...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/15335#discussion_r82869819 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1255,27 +1255,46 @@ class DAGScheduler( s"longer running") } - if (disallowStageRetryForTest) { -abortStage(failedStage, "Fetch failure will not retry stage due to testing config", - None) - } else if (failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId)) { -abortStage(failedStage, s"$failedStage (${failedStage.name}) " + - s"has failed the maximum allowable number of " + - s"times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. " + - s"Most recent failure reason: ${failureMessage}", None) - } else { -if (failedStages.isEmpty) { - // Don't schedule an event to resubmit failed stages if failed isn't empty, because - // in that case the event will already have been scheduled. - // TODO: Cancel running tasks in the stage - logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " + -s"$failedStage (${failedStage.name}) due to fetch failure") - messageScheduler.schedule(new Runnable { -override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages) - }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS) + val shouldAbortStage = +failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId) || +disallowStageRetryForTest + + if (shouldAbortStage) { +val abortMessage = if (disallowStageRetryForTest) { + "Fetch failure will not retry stage due to testing config" +} else { + s"""$failedStage (${failedStage.name}) + |has failed the maximum allowable number of + |times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. + |Most recent failure reason: $failureMessage""".stripMargin.replaceAll("\n", " ") } +abortStage(failedStage, abortMessage, None) + } else { // update failedStages and make sure a ResubmitFailedStages event is enqueued +// TODO: Cancel running tasks in the failed stage -- cf. SPARK-17064 +val noResubmitEnqueued = !failedStages.contains(failedStage) failedStages += failedStage failedStages += mapStage +if (noResubmitEnqueued) { + // We expect one executor failure to trigger many FetchFailures in rapid succession, + // but all of those task failures can typically be handled by a single resubmission of + // the failed stage. We avoid flooding the scheduler's event queue with resubmit + // messages by checking whether a resubmit is already in the event queue for the + // failed stage. If there is already a resubmit enqueued for a different failed + // stage, that event would also be sufficient to handle the current failed stage, but + // producing a resubmit for each failed stage makes debugging and logging a little + // simpler while not producing an overwhelming number of scheduler events. + logInfo( +s"Resubmitting $mapStage (${mapStage.name}) and " + +s"$failedStage (${failedStage.name}) due to fetch failure" + ) + messageScheduler.schedule( --- End diff -- 1. I don't like "Periodically" in your suggested comment, since this is a one-shot action after a delay of RESUBMIT_TIMEOUT milliseconds. 2. I agree that this delay-before-resubmit logic is suspect. If we are both thinking correctly that a 200 ms delay on top of the time to re-run the `mapStage` is all but inconsequential, then removing it in this PR would be fine. If there are unanticipated consequences, though, I'd prefer to have that change in a separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.d...
Github user avulanov commented on a diff in the pull request: https://github.com/apache/spark/pull/15382#discussion_r82869165 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -757,7 +758,10 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging { def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH) - def warehousePath: String = new Path(getConf(WAREHOUSE_PATH)).toString + def warehousePath: String = { +val path = new Path(getConf(WAREHOUSE_PATH)) +FileSystem.get(path.toUri, new Configuration()).makeQualified(path).toString --- End diff -- You [mentioned](https://github.com/apache/spark/pull/13868#discussion_r82154809) that the original issue is as follows: _"...the usages of the new makeQualifiedPath are a bit wrong in that they explicitly resolve the path against the Hadoop file system, which can be HDFS."_ Should we rather look into the code that does `makeQualifiedPath` on `warehousePath` given Hadoop FS configuration? The fix would be to have a special case with paths that do not have a schema. Actually, could you give a link to this code, I could not find it right away? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 It might not be a R specific issue. I am trying to create a test case on Scala side in SQLUtilsSuite.scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 I think we should find out the root cause of the negative length of "NA" field. Yesterday, I debugged R side and I have not found out the reason yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 New test on Mac: > df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = 1:12) > > dim(createDataFrame(df)) 16/10/11 12:10:30 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.IllegalArgumentException: Invalid type N at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:86) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13440 **[Test build #66756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66756/consoleFull)** for PR 13440 at commit [`83f5e83`](https://github.com/apache/spark/commit/83f5e83fb87407bdd7dc8d740fba6fb30d1da3aa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15410: [SPARK-17843][Web UI] Indicate event logs pending for pr...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/15410 ah, yeah startup would definitely be a good case for this and like I mentioned its better then nothing so I'm ok with concept. I wonder for the other use case where it hasn't looked in ~ 10 seconds if it would be more clear to users if we put a little string at the top that is like "last updated time XX:XX:XX" or app list current as of XX:XX:XX. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15421: [SPARK-17811] SparkR cannot parallelize data.fram...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15421#discussion_r82865118 --- Diff: core/src/main/scala/org/apache/spark/api/r/SerDe.scala --- @@ -125,15 +125,24 @@ private[spark] object SerDe { } def readDate(in: DataInputStream): Date = { -Date.valueOf(readString(in)) +val inStr = readString(in) --- End diff -- Also seen on Mac. I am building with your new patch to test again on Mac. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 MacBook Pro (Retina, 15-inch, Mid 2015) This is the machine that I test the patch on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66755 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66755/consoleFull)** for PR 15421 at commit [`59827a1`](https://github.com/apache/spark/commit/59827a19db93604120dc7229f6ed82777b4cd354). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @falaki I saw the exception on Mac too. But I don't find the root cause of negative length in the input stream. Catching the exception will solve the problem. Do you want to explore the reason of negative index? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @wangmiao1981 thanks for testing on Windows. I added a check for this. Would you please try again and let me know? Unfortunately, I don't have access to a windows box for testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15421: [SPARK-17811] SparkR cannot parallelize data.fram...
Github user falaki commented on a diff in the pull request: https://github.com/apache/spark/pull/15421#discussion_r82864117 --- Diff: core/src/main/scala/org/apache/spark/api/r/SerDe.scala --- @@ -125,15 +125,24 @@ private[spark] object SerDe { } def readDate(in: DataInputStream): Date = { -Date.valueOf(readString(in)) +val inStr = readString(in) --- End diff -- That is only on Windows right? I added a catch for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15335: [SPARK-17769][Core][Scheduler]Some FetchFailure r...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/15335#discussion_r82864060 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1255,27 +1255,46 @@ class DAGScheduler( s"longer running") } - if (disallowStageRetryForTest) { -abortStage(failedStage, "Fetch failure will not retry stage due to testing config", - None) - } else if (failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId)) { -abortStage(failedStage, s"$failedStage (${failedStage.name}) " + - s"has failed the maximum allowable number of " + - s"times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. " + - s"Most recent failure reason: ${failureMessage}", None) - } else { -if (failedStages.isEmpty) { - // Don't schedule an event to resubmit failed stages if failed isn't empty, because - // in that case the event will already have been scheduled. - // TODO: Cancel running tasks in the stage - logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " + -s"$failedStage (${failedStage.name}) due to fetch failure") - messageScheduler.schedule(new Runnable { -override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages) - }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS) + val shouldAbortStage = +failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId) || +disallowStageRetryForTest + + if (shouldAbortStage) { +val abortMessage = if (disallowStageRetryForTest) { + "Fetch failure will not retry stage due to testing config" +} else { + s"""$failedStage (${failedStage.name}) + |has failed the maximum allowable number of + |times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. + |Most recent failure reason: $failureMessage""".stripMargin.replaceAll("\n", " ") } +abortStage(failedStage, abortMessage, None) + } else { // update failedStages and make sure a ResubmitFailedStages event is enqueued +// TODO: Cancel running tasks in the failed stage -- cf. SPARK-17064 +val noResubmitEnqueued = !failedStages.contains(failedStage) --- End diff -- Right here is the only place we put anything into `failedStages`, so `failedStage` and `mapStage` always go in as pairs. The only places where we remove things from `failedStages` are `resubmitFailedStages` and `DAGScheduler#cleanupStateForJobAndIndependentStages#removeStage`. We clear `failedStages` in `resubmitFailedStages`, so the only place where `failedStage` and `mapStage` could get unpaired in `failedStages` is in `cleanupStateForJobAndIndependentStages#removeStage`. That would happen if the number of Jobs that use `failedStage` and `mapStage` is unequal. If I'm thinking correctly, that could only happen if the `mapStage` is used by more Jobs than is the `failedStage`. In that case, cleaning up the last Job that uses `failedStage` would remove `failedStage` from `failedStages` while `mapStage` would remain. To fall into your proposed `|| !failedStages.contains(mapStage)` branch, another `failedStage` needing `mapStage`, this time coming from one of the remaining Jobs using `mapStage`, would need to fail. If that is the case, then we still want to log the failure of the new `failedStage`, so I don't think we want `|| !failedStages.contains(mapStage)` -- without it, we'll get a duplicate of `mapStage` added to `failedStages`, but that's no big deal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15421: [SPARK-17811] SparkR cannot parallelize data.fram...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15421#discussion_r82863921 --- Diff: core/src/main/scala/org/apache/spark/api/r/SerDe.scala --- @@ -125,15 +125,24 @@ private[spark] object SerDe { } def readDate(in: DataInputStream): Date = { -Date.valueOf(readString(in)) +val inStr = readString(in) --- End diff -- readString will finally gets a negative len, which causes failure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 >> Finally, this would require us to read the schema files. That's something I'm trying to avoid in this patch. > Not sure what you mean here, but the parquet change should be execution time only. I'll submit a pr here for that. Okay. I look forward to seeing that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15431: [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiveBayes ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15431 LGTM2 Thanks! Is it fine with you if this just gets fixed in master, not branch-2.0 (since the other PR is not in branch-2.0 since it adds a new public API)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @falaki I patched your fix to a clean build. I still see the following error: > df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = 1:12) > > dim(createDataFrame(df)) 16/10/11 11:51:00 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NegativeArraySizeException at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I am still debugging. It seems that source on R side has some issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11336 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11336 **[Test build #66753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66753/consoleFull)** for PR 11336 at commit [`8f906a2`](https://github.com/apache/spark/commit/8f906a2e8050c391355c3ddab811d990020728cb). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11336 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66753/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15421 hmm, still the same error in the new test case in appveyor ``` Failed - 1. Error: SPARK-17811: can create DataFrame containing NA as date and time (@test_sparkSQL.R#388) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 105.0 failed 1 times, most recent failure: Lost task 0.0 in stage 105.0 (TID 109, localhost): java.lang.NegativeArraySizeException at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...
Github user ericl commented on the issue: https://github.com/apache/spark/pull/14690 > For one thing, a ListingFileCatalog performs a file tree traversal right off the bat. However, the external catalog returns the locations of partitions as part of the listPartitionsByFilter call. I believe that should suffice for the purpose of building a query plan for metastore-backed tables and executing it. You'd have to re-implement a large portion of the parallel traversal logic here right? I think we should keep this PR minimal and leave that for future work. I am also thinking of adding a per-directory file listing cache as a followup to avoid performance regressions, which would likely involve refactoring this path anyways. >I would be wary of amending our data sources to support case-insensitive field resolution. For one thing, strictly speaking it can lead to ambiguity in schema resolution. In the—potential but unlikely—event that a (case-sensitive) data source schema has two distinct fields x1 and x2 such that x1.toLowerCase == x2.toLowerCase we're going to get undefined behavior. > For another, for case-sensitive data sources this adds code complexity in their implementation. I do agree this might be an issue with other datasources. For parquet though, I talked with @liancheng and we don't think there are any issues with supporting case-insensitive field resolution. Given that, I think we can also leave this for future work when we add datasource table support. It might also be that we need to add back something like https://github.com/apache/spark/pull/14750 > Finally, this would require us to read the schema files. That's something I'm trying to avoid in this patch. Not sure what you mean here, but the parquet change should be execution time only. I'll submit a pr here for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15431: [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiveBayes ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15431 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15389 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66746/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15295 LGTM except that one comment on naming. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15295#discussion_r82860005 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala --- @@ -132,4 +136,9 @@ class RuntimeConfig private[sql](sqlConf: SQLConf = new SQLConf) { sqlConf.contains(key) } + private def assertNotGlobalSQLConf(key: String): Unit = { --- End diff -- rename to assertNonStaticConf --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15295#discussion_r82859976 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala --- @@ -36,6 +37,7 @@ class RuntimeConfig private[sql](sqlConf: SQLConf = new SQLConf) { * @since 2.0.0 */ def set(key: String, value: String): Unit = { +assertNotGlobalSQLConf(key) --- End diff -- should be assertNonStaticConf here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66746 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66746/consoleFull)** for PR 15421 at commit [`82ec5c8`](https://github.com/apache/spark/commit/82ec5c81e0c0e48bfb6008dcdeef544457cfc014). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66747/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14719 **[Test build #66747 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66747/consoleFull)** for PR 14719 at commit [`9ad2c85`](https://github.com/apache/spark/commit/9ad2c85241460aa878867a775cd13aca14e72324). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15432: [SPARK-17854][SQL] rand/randn allows null as input seed
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15432 hm - maybe we should just cast any NullType input into some concrete type defined by an ExpectsInputTypes expression? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15437: [SPARK-17876] Write StructuredStreaming WAL to a stream ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15437 **[Test build #66754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66754/consoleFull)** for PR 15437 at commit [`4d50be5`](https://github.com/apache/spark/commit/4d50be565841bdb6d647c75e36168151e1e8a621). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82858522 --- Diff: python/pyspark/sql/streaming.py --- @@ -189,6 +189,282 @@ def resetTerminated(self): self._jsqm.resetTerminated() +class StreamingQueryStatus(object): +"""A class used to report information about the progress of a StreamingQuery. + +.. note:: Experimental + +.. versionadded:: 2.1 +""" + +def __init__(self, jsqs): +self._jsqs = jsqs + +def __str__(self): +""" +Pretty string of this query status. + +>>> print(sqs) +StreamingQueryStatus: +Query name: query +Query id: 1 +Status timestamp: 123 +Input rate: 1.0 rows/sec +Processing rate 2.0 rows/sec +Latency: 345.0 ms +Trigger status: +key: value +Source statuses [1 source]: +Source 1:MySource1 +Available offset: #0 +Input rate: 4.0 rows/sec +Processing rate: 5.0 rows/sec +Trigger status: +key: value +Sink status: MySink +Committed offsets: [#1, -] +""" +return self._jsqs.toString() + +@property +@ignore_unicode_prefix +@since(2.1) +def name(self): +""" +Name of the query. This name is unique across all active queries. + +>>> sqs.name +u'query' +""" +return self._jsqs.name() + +@property +@since(2.1) +def id(self): +""" +Id of the query. This id is unique across all queries that have been started in +the current process. + +>>> int(sqs.id) +1 +""" +return self._jsqs.id() + +@property +@since(2.1) +def timestamp(self): +""" +Timestamp (ms) of when this query was generated. + +>>> int(sqs.timestamp) +123 +""" +return self._jsqs.timestamp() + +@property +@since(2.1) +def inputRate(self): +""" +Current rate (rows/sec) at which data is being generated by all the sources. + +>>> sqs.inputRate +1.0 +""" +return self._jsqs.inputRate() + +@property +@since(2.1) +def processingRate(self): +""" +Current rate (rows/sec) at which the query is processing data from all the sources. + +>>> sqs.processingRate +2.0 +""" +return self._jsqs.processingRate() + +@property +@since(2.1) +def latency(self): +""" +Current average latency between the data being available in source and the sink +writing the corresponding output. + +>>> sqs.latency +345.0 +""" +if (self._jsqs.latency().nonEmpty()): +return self._jsqs.latency().get() +else: +return None + +@property +@since(2.1) +def sourceStatuses(self): +""" +Current statuses of the sources. + +>>> len(sqs.sourceStatuses) +1 +>>> sqs.sourceStatuses[0].description +u'MySource1' +""" +return [SourceStatus(ss) for ss in self._jsqs.sourceStatuses()] + +@property +@since(2.1) +def sinkStatus(self): +""" +Current status of the sink. + +>>> sqs.sinkStatus.description +u'MySink' +""" +return SinkStatus(self._jsqs.sinkStatus()) + +@property +@since(2.1) +def triggerStatus(self): +""" +Low-level detailed status of the last completed/currently active trigger. + +>>> sqs.triggerStatus +{u'key': u'value'} --- End diff -- Ah right, this also shown as examples in the doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15437: [SPARK-17876] Write StructuredStreaming WAL to a ...
GitHub user brkyvz opened a pull request: https://github.com/apache/spark/pull/15437 [SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once ## What changes were proposed in this pull request? The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch. You may come across stacktraces that look like: ``` java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.lang.StringCoding.encode(StringCoding.java:350) at java.lang.String.getBytes(String.java:941) at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127) ``` The safer way is to write to an output stream so that we don't have to materialize a huge string. ## How was this patch tested? Existing unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/brkyvz/spark ser-to-stream Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15437.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15437 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15375 **[Test build #3324 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3324/consoleFull)** for PR 15375 at commit [`62ab47b`](https://github.com/apache/spark/commit/62ab47b016aeb42c0721b52c4c37d502db18c535). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82857280 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala --- @@ -0,0 +1,244 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import java.{util => ju} + +import scala.collection.mutable + +import com.codahale.metrics.{Gauge, MetricRegistry} + +import org.apache.spark.internal.Logging +import org.apache.spark.metrics.source.{Source => CodahaleSource} +import org.apache.spark.util.Clock + +/** + * Class that manages all the metrics related to a StreamingQuery. It does the following. + * - Calculates metrics (rates, latencies, etc.) based on information reported by StreamExecution. + * - Allows the current metric values to be queried + * - Serves some of the metrics through Codahale/DropWizard metrics + * + * @param sources Unique set of sources in a query + * @param triggerClock Clock used for triggering in StreamExecution + * @param codahaleSourceName Root name for all the Codahale metrics + */ +class StreamMetrics(sources: Set[Source], triggerClock: Clock, codahaleSourceName: String) + extends CodahaleSource with Logging { + + import StreamMetrics._ + + // Trigger infos + private val triggerStatus = new mutable.HashMap[String, String] + private val sourceTriggerStatus = new mutable.HashMap[Source, mutable.HashMap[String, String]] + + // Rate estimators for sources and sinks + private val inputRates = new mutable.HashMap[Source, RateCalculator] + private val processingRates = new mutable.HashMap[Source, RateCalculator] + + // Number of input rows in the current trigger + private val numInputRows = new mutable.HashMap[Source, Long] + private var numOutputRows: Option[Long] = None + private var currentTriggerStartTimestamp: Long = -1 + private var previousTriggerStartTimestamp: Long = -1 + private var latency: Option[Double] = None + + override val sourceName: String = codahaleSourceName + override val metricRegistry: MetricRegistry = new MetricRegistry + + // === Initialization === + + // Metric names should not have . in them, so that all the metrics of a query are identified + // together in Ganglia as a single metric group + registerGauge("inputRate-total", currentInputRate) + registerGauge("processingRate-total", () => currentProcessingRate) + registerGauge("latency", () => currentLatency().getOrElse(-1.0)) + + sources.foreach { s => +inputRates.put(s, new RateCalculator) +processingRates.put(s, new RateCalculator) +sourceTriggerStatus.put(s, new mutable.HashMap[String, String]) + +registerGauge(s"inputRate-${s.toString}", () => currentSourceInputRate(s)) +registerGauge(s"processingRate-${s.toString}", () => currentSourceProcessingRate(s)) + } + + // === Setter methods === + + def reportTriggerStarted(triggerId: Long): Unit = synchronized { +numInputRows.clear() +numOutputRows = None +triggerStatus.clear() +sourceTriggerStatus.values.foreach(_.clear()) + +reportTriggerStatus(TRIGGER_ID, triggerId) +sources.foreach(s => reportSourceTriggerStatus(s, TRIGGER_ID, triggerId)) +reportTriggerStatus(ACTIVE, true) +currentTriggerStartTimestamp = triggerClock.getTimeMillis() +reportTriggerStatus(START_TIMESTAMP, currentTriggerStartTimestamp) + } + + def reportTriggerStatus[T](key: String, value: T): Unit = synchronized { +triggerStatus.put(key, value.toString) + } + + def reportSourceTriggerStatus[T](source: Source, key: String, value: T): Unit = synchronized { +sourceTriggerStatus(source).put(key, value.toString) + } + + def reportNumInputRows(inputRows: Map[Source, Long]): Unit =
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82856942 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -530,7 +692,7 @@ class StreamExecution( case object TERMINATED extends State } -object StreamExecution { +object StreamExecution extends Logging { --- End diff -- removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82857011 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -516,12 +563,127 @@ class StreamExecution( """.stripMargin } - private def toInfo: StreamingQueryInfo = { -new StreamingQueryInfo( - this.name, - this.id, - this.sourceStatuses, - this.sinkStatus) + /** + * Report row metrics of the executed trigger + * @param triggerExecutionPlan Execution plan of the trigger + * @param triggerLogicalPlan Logical plan of the trigger, generated from the query logical plan + * @param sourceToDF Source to DataFrame returned by the source.getBatch in this trigger + */ + private def reportNumRows( + triggerExecutionPlan: SparkPlan, + triggerLogicalPlan: LogicalPlan, + sourceToDF: Map[Source, DataFrame]): Unit = { +// We want to associate execution plan leaves to sources that generate them, so that we match +// the their metrics (e.g. numOutputRows) to the sources. To do this we do the following. +// Consider the translation from the streaming logical plan to the final executed plan. +// +// streaming logical plan (with sources) <==> trigger's logical plan <==> executed plan +// +// 1. We keep track of streaming sources associated with each leaf in the trigger's logical plan +//- Each logical plan leaf will be associated with a single streaming source. +//- There can be multiple logical plan leaves associated a streaming source. --- End diff -- fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82856924 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -221,8 +247,15 @@ class StreamExecution( } } finally { state = TERMINATED + + // Update metrics and status + streamMetrics.stop() + sparkSession.sparkContext.env.metricsSystem.removeSource(streamMetrics) + updateStatus() + + // Notify others sparkSession.streams.notifyQueryTermination(StreamExecution.this) - postEvent(new QueryTerminated(this.toInfo, exception.map(_.cause).map(Utils.exceptionString))) + postEvent(new QueryTerminated(status, exception.map(_.cause).map(Utils.exceptionString))) --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82856802 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -105,11 +105,21 @@ class StreamExecution( var lastExecution: QueryExecution = null @volatile - var streamDeathCause: StreamingQueryException = null + private var streamDeathCause: StreamingQueryException = null /* Get the call site in the caller thread; will pass this into the micro batch thread */ private val callSite = Utils.getCallSite() + /** Metrics for this query */ + private val streamMetrics = +new StreamMetrics(uniqueSources.toSet, triggerClock, s"StructuredStreaming.$name") + + @volatile + private var currentStatus: StreamingQueryStatus = null + + @volatile + private var metricWarningLogged: Boolean = false --- End diff -- Added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82856363 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala --- @@ -0,0 +1,244 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import java.{util => ju} + +import scala.collection.mutable + +import com.codahale.metrics.{Gauge, MetricRegistry} + +import org.apache.spark.internal.Logging +import org.apache.spark.metrics.source.{Source => CodahaleSource} +import org.apache.spark.util.Clock + +/** + * Class that manages all the metrics related to a StreamingQuery. It does the following. + * - Calculates metrics (rates, latencies, etc.) based on information reported by StreamExecution. + * - Allows the current metric values to be queried + * - Serves some of the metrics through Codahale/DropWizard metrics + * + * @param sources Unique set of sources in a query + * @param triggerClock Clock used for triggering in StreamExecution + * @param codahaleSourceName Root name for all the Codahale metrics + */ +class StreamMetrics(sources: Set[Source], triggerClock: Clock, codahaleSourceName: String) + extends CodahaleSource with Logging { + + import StreamMetrics._ + + // Trigger infos + private val triggerStatus = new mutable.HashMap[String, String] + private val sourceTriggerStatus = new mutable.HashMap[Source, mutable.HashMap[String, String]] + + // Rate estimators for sources and sinks + private val inputRates = new mutable.HashMap[Source, RateCalculator] + private val processingRates = new mutable.HashMap[Source, RateCalculator] + + // Number of input rows in the current trigger + private val numInputRows = new mutable.HashMap[Source, Long] + private var numOutputRows: Option[Long] = None + private var currentTriggerStartTimestamp: Long = -1 + private var previousTriggerStartTimestamp: Long = -1 + private var latency: Option[Double] = None + + override val sourceName: String = codahaleSourceName + override val metricRegistry: MetricRegistry = new MetricRegistry + + // === Initialization === + + // Metric names should not have . in them, so that all the metrics of a query are identified + // together in Ganglia as a single metric group + registerGauge("inputRate-total", currentInputRate) + registerGauge("processingRate-total", () => currentProcessingRate) + registerGauge("latency", () => currentLatency().getOrElse(-1.0)) + + sources.foreach { s => +inputRates.put(s, new RateCalculator) +processingRates.put(s, new RateCalculator) +sourceTriggerStatus.put(s, new mutable.HashMap[String, String]) + +registerGauge(s"inputRate-${s.toString}", () => currentSourceInputRate(s)) +registerGauge(s"processingRate-${s.toString}", () => currentSourceProcessingRate(s)) + } + + // === Setter methods === + + def reportTriggerStarted(triggerId: Long): Unit = synchronized { +numInputRows.clear() +numOutputRows = None +triggerStatus.clear() +sourceTriggerStatus.values.foreach(_.clear()) + +reportTriggerStatus(TRIGGER_ID, triggerId) +sources.foreach(s => reportSourceTriggerStatus(s, TRIGGER_ID, triggerId)) +reportTriggerStatus(ACTIVE, true) +currentTriggerStartTimestamp = triggerClock.getTimeMillis() +reportTriggerStatus(START_TIMESTAMP, currentTriggerStartTimestamp) + } + + def reportTriggerStatus[T](key: String, value: T): Unit = synchronized { +triggerStatus.put(key, value.toString) + } + + def reportSourceTriggerStatus[T](source: Source, key: String, value: T): Unit = synchronized { +sourceTriggerStatus(source).put(key, value.toString) + } + + def reportNumInputRows(inputRows: Map[Source, Long]): Unit =
[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I believe that using a method like `TableFileCatalog.filterPartitions` to build a new file catalog restricted to some pruned partitions is a sound approach, however I'm starting to reconsider the implementation. Specifically, I'm thinking that having that method return a `ListingFileCatalog` is the wrong thing to do. It may be unnecessarily heavy handed. For one thing, a `ListingFileCatalog` performs a file tree traversal right off the bat. However, the external catalog returns the locations of partitions as part of the `listPartitionsByFilter` call. I believe that should suffice for the purpose of building a query plan for metastore-backed tables and executing it. The current implementation works, but I'm going to explore the latter possibility as a more efficient implementation. If I can make it work without too much complexity I'll probably just push it as a new commit to this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11336 **[Test build #66753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66753/consoleFull)** for PR 11336 at commit [`8f906a2`](https://github.com/apache/spark/commit/8f906a2e8050c391355c3ddab811d990020728cb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82855608 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -516,12 +563,127 @@ class StreamExecution( """.stripMargin } - private def toInfo: StreamingQueryInfo = { -new StreamingQueryInfo( - this.name, - this.id, - this.sourceStatuses, - this.sinkStatus) + /** + * Report row metrics of the executed trigger + * @param triggerExecutionPlan Execution plan of the trigger + * @param triggerLogicalPlan Logical plan of the trigger, generated from the query logical plan + * @param sourceToDF Source to DataFrame returned by the source.getBatch in this trigger + */ + private def reportNumRows( + triggerExecutionPlan: SparkPlan, + triggerLogicalPlan: LogicalPlan, + sourceToDF: Map[Source, DataFrame]): Unit = { +// We want to associate execution plan leaves to sources that generate them, so that we match +// the their metrics (e.g. numOutputRows) to the sources. To do this we do the following. +// Consider the translation from the streaming logical plan to the final executed plan. +// +// streaming logical plan (with sources) <==> trigger's logical plan <==> executed plan +// +// 1. We keep track of streaming sources associated with each leaf in the trigger's logical plan +//- Each logical plan leaf will be associated with a single streaming source. +//- There can be multiple logical plan leaves associated a streaming source. +//- There can be leaves not associated with any streaming source, because they were +// generated from a batch source (e.g. stream-batch joins) +// +// 2. Assuming that the executed plan has same number of leaves in the same order as that of +//the trigger logical plan, we associate executed plan leaves with corresponding +//streaming sources. +// +// 3. For each source, we sum the metrics of the associated execution plan leaves. +// +val logicalPlanLeafToSource = sourceToDF.flatMap { case (source, df) => + df.logicalPlan.collectLeaves().map { leaf => leaf -> source } +} +val allLogicalPlanLeaves = triggerLogicalPlan.collectLeaves() // includes non-streaming sources +val allExecPlanLeaves = triggerExecutionPlan.collectLeaves() +val sourceToNumInputRows: Map[Source, Long] = + if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) { +val execLeafToSource = allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap { --- End diff -- See the condition in the previous line :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/15307#discussion_r82855520 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -176,7 +184,9 @@ class StreamExecution( // Mark ACTIVE and then post the event. QueryStarted event is synchronously sent to listeners, // so must mark this as ACTIVE first. state = ACTIVE - postEvent(new QueryStarted(this.toInfo)) // Assumption: Does not throw exception. + sparkSession.sparkContext.env.metricsSystem.registerSource(streamMetrics) --- End diff -- This is not expensive. A lot of these sources are registered by default in the SparkContext, the additional overhead is minimal. Also, the overhead depends solely on how frequently the Dropwizard metrics polls for the values of rates, latencies, etc. Typically this is every 10s of seconds. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15436: [SPARK-17875] [BUILD] Remove unneeded direct depe...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15436#discussion_r82854783 --- Diff: NOTICE --- @@ -162,7 +162,7 @@ Please visit the Netty web site for more information: * http://netty.io/ -Copyright 2011 The Netty Project +Copyright 2014 The Netty Project --- End diff -- The license changes are overdue update of the Netty 4.x license from 3.x's version --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15436: [SPARK-17875] [BUILD] Remove unneeded direct depe...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15436#discussion_r82854900 --- Diff: dev/deps/spark-deps-hadoop-2.3 --- @@ -130,7 +130,6 @@ metrics-json-3.1.2.jar metrics-jvm-3.1.2.jar minlog-1.3.0.jar mx4j-3.0.2.jar -netty-3.8.0.Final.jar --- End diff -- I can't quite figure this out: Hadoop 2.2 and 2.6-2.7 do transitively depend on Netty 3.6.x. 2.3 and 2.4 do not. _shrug_ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15436: [SPARK-17875] [BUILD] Remove unneeded direct dependence ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15436 **[Test build #66752 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66752/consoleFull)** for PR 15436 at commit [`a5c5c31`](https://github.com/apache/spark/commit/a5c5c3146e702a5c6ac8a86648f58f44d13a95f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15436: [SPARK-17875] [BUILD] Remove unneeded direct depe...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/15436 [SPARK-17875] [BUILD] Remove unneeded direct dependence on Netty 3.x ## What changes were proposed in this pull request? Remove unneeded direct dependency on Netty 3.x. I left the `dependencyManagement` entry because some Hadoop libs still use an older version of Netty 3, and I thought it would be weird if the transitive version we reference went backwards. (Note too that Flume declares a direct separate dependency in test scope on Netty 3.4.x) ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-17875 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15436.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15436 commit a5c5c3146e702a5c6ac8a86648f58f44d13a95f2 Author: Sean OwenDate: 2016-10-11T18:10:08Z Remove unneeded direct dependency on Netty 3.x --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15190: [SPARK-17620][SQL] Determine Serde by hive.default.filef...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/15190 @yhuai We will use Parquet format in your example. We look at ```SQL spark.sql.sources.default ``` configuration to decide on the format to use ? Here is the output for your perusal. ``` SQL spark-sql> set spark.sql.hive.convertCTAS=true; spark.sql.hive.convertCTAS true Time taken: 3.309 seconds, Fetched 1 row(s) spark-sql> set hive.default.fileformat=orc; hive.default.fileformat orc Time taken: 0.053 seconds, Fetched 1 row(s) spark-sql> CREATE TABLE IF NOT EXISTS test select 1 from foo; spark-sql> describe formatted test; ... # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ``` Now change ```spark.sql.sources.default=orc``` ```SQL spark-sql> set spark.sql.sources.default=orc; spark.sql.sources.default orc spark-sql> CREATE TABLE IF NOT EXISTS test2 select 1 from foo; Time taken: 0.451 seconds spark-sql> describe formatted test2; ... # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat ``` Please let me know if you have any further questions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15410: [SPARK-17843][Web UI] Indicate event logs pending for pr...
Github user vijoshi commented on the issue: https://github.com/apache/spark/pull/15410 @tgravescs - you're right - for newer logs that are generated, there could be a window of time (10 secs or whatever the user configures) where the new logs are not picked up for replay and the UI doesn't say anything about them. however the issue we see is more with old completed apps. a little after history server startup, user browses to the app list and has no idea why the older completed apps are missing (perhaps those that were visible just before the history server was restarted). since the polling for logs/replay is scheduled immediately as part of history server startup (zero delay for the first round), this fix could help these cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15422 > For example, in MR you have the ability to even set the percentage of bad records you want to tolerate (we dont have that in spark). I may be wrong. But in MR, I think bad records just means `map` or `reduce` throws an exception. It's not related to any IOException (including EOFExcpetion). (https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L1490) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14690 **[Test build #66751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66751/consoleFull)** for PR 14690 at commit [`2762efd`](https://github.com/apache/spark/commit/2762efd925325a9ab732b34e6672b704dc514062). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 Ah cripes. I committed something I didn't want to. I'm rebasing again in a few... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15435 I'll try to take a look before too long. For now, I see there are no tests, could you please add tests, using the summary tests for binary classification as a guide? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15384: [SPARK-17346][SQL][Tests]Fix the flaky topic dele...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15384 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15384: [SPARK-17346][SQL][Tests]Fix the flaky topic deletion in...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/15384 Merging it to master, and branch 2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r82850487 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -225,13 +225,16 @@ case class FileSourceScanExec( } // These metadata values make scan plans uniquely identifiable for equality checking. - override val metadata: Map[String, String] = Map( -"Format" -> relation.fileFormat.toString, -"ReadSchema" -> outputSchema.catalogString, -"Batched" -> supportsBatch.toString, -"PartitionFilters" -> partitionFilters.mkString("[", ", ", "]"), -"PushedFilters" -> dataFilters.mkString("[", ", ", "]"), -"InputPaths" -> relation.location.paths.mkString(", ")) + override val metadata: Map[String, String] = { +def seqToString(seq: Seq[Any]) = seq.mkString("[", ", ", "]") +val location = relation.location +Map("Format" -> relation.fileFormat.toString, + "ReadSchema" -> outputSchema.catalogString, + "Batched" -> supportsBatch.toString, + "PartitionFilters" -> seqToString(partitionFilters), + "PushedFilters" -> seqToString(dataFilters), + "Location" -> (location.getClass.getSimpleName + seqToString(location.rootPaths))) --- End diff -- I'm going to pull this value out to a local `val`. The reason I don't want to make this part of the `BasicFileCatalog` type is that this format seems specific to this class's `metadata` method. For example, we're using the method-local `seqToString` method to format the `rootPaths` list. Regarding the partition count, this is actually a little trickier than I expected. I'm going to expand on that in a standalone PR comment after the next push. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15422 @mridulm for the scenario you're imagining, maybe the data is OK, sure. That doesn't mean it's true in all cases. Yeah, this is really to work around bad input, which you can to some degree do at the user level. Other parts of Spark don't work this way. I'm neutral on whether this is a good idea at all, but would prefer consistency more than anything. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13194: [SPARK-15402] [ML] [PySpark] PySpark ml.evaluation shoul...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/13194 Just one question, otherwise LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13194: [SPARK-15402] [ML] [PySpark] PySpark ml.evaluatio...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/13194#discussion_r82849461 --- Diff: python/pyspark/ml/evaluation.py --- @@ -21,7 +21,8 @@ from pyspark.ml.wrapper import JavaParams from pyspark.ml.param import Param, Params, TypeConverters from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol, HasRawPredictionCol -from pyspark.ml.common import inherit_doc +from pyspark.ml.util import JavaMLReadable, JavaMLWritable +from pyspark.mllib.common import inherit_doc --- End diff -- Just wondering why change to import from `mllib.common`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15398: [SPARK-17647][SQL] Fix backslash escaping in 'LIK...
Github user jodersky commented on a diff in the pull request: https://github.com/apache/spark/pull/15398#discussion_r82849408 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala --- @@ -25,26 +25,25 @@ object StringUtils { // replace the _ with .{1} exactly match 1 time of any character // replace the % with .*, match 0 or more times with any character - def escapeLikeRegex(v: String): String = { -if (!v.isEmpty) { - "(?s)" + (' ' +: v.init).zip(v).flatMap { -case (prev, '\\') => "" -case ('\\', c) => - c match { -case '_' => "_" -case '%' => "%" -case _ => Pattern.quote("\\" + c) - } -case (prev, c) => - c match { -case '_' => "." -case '%' => ".*" -case _ => Pattern.quote(Character.toString(c)) - } - }.mkString -} else { - v + def escapeLikeRegex(str: String): String = { +val builder = new StringBuilder() +var escaping = false +for (next <- str) { + if (escaping) { +builder ++= Pattern.quote(Character.toString(next)) --- End diff -- Every character after a backslash is quoted, so `\\a` becomes `\Q\\E\Qa\E`. Is this not the intended behaviour? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15422 @marmbrus +1 on logging, that is definitely something which was probably missed here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org