[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14959
  
**[Test build #66758 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66758/consoleFull)**
 for PR 14959 at commit 
[`d692e71`](https://github.com/apache/spark/commit/d692e715e9c4da3a00f18c1693b60772492a80e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...

2016-10-11 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/14959
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13194: [SPARK-15402] [ML] [PySpark] PySpark ml.evaluatio...

2016-10-11 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/13194#discussion_r82877590
  
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -311,19 +330,25 @@ def setParams(self, predictionCol="prediction", 
labelCol="label",
 
 if __name__ == "__main__":
 import doctest
+import tempfile
+import pyspark.ml.evaluation
 from pyspark.sql import SparkSession
-globs = globals().copy()
+globs = pyspark.ml.evaluation.__dict__.copy()
 # The small batch size here ensures that we see multiple batches,
 # even in these small test examples:
 spark = SparkSession.builder\
 .master("local[2]")\
 .appName("ml.evaluation tests")\
 .getOrCreate()
-sc = spark.sparkContext
-globs['sc'] = sc
 globs['spark'] = spark
-(failure_count, test_count) = doctest.testmod(
-globs=globs, optionflags=doctest.ELLIPSIS)
-spark.stop()
-if failure_count:
-exit(-1)
--- End diff --

should there still be an `exit(-1)` when failures?  If not then you can 
remove the returned variables from `doctest.testmod`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...

2016-10-11 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/15338#discussion_r82866433
  
--- Diff: sbin/spark-daemon.sh ---
@@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then
 export SPARK_NICENESS=0
 fi
 
+execute_command() {
+  command="$@"
+  if [ "$SPARK_NO_DAEMONIZE" != "" ]; then
--- End diff --

this only checks if the variable is non-empty. Setting it to the empty 
string will not work. E.g. `export SPARK_NO_DAEMONIZE=''` will not satisfy the 
condition.

Something like `if [ -z ${SPARK_NO_DAEMONIZE+set} ];` will also handle the 
null-case (see [this SO 
thread](http://stackoverflow.com/questions/3601515/how-to-check-if-a-variable-is-set-in-bash/3601734#3601734)).
 Another, more elegant, solution would be to use the `[[ -v 
SPARK_NO_DAEMONIZE]]`, however that requires at least bash 4.2 and is probably 
not available on all systems that spark runs on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...

2016-10-11 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/15338#discussion_r82874206
  
--- Diff: sbin/spark-daemon.sh ---
@@ -146,13 +176,11 @@ run_command() {
 
   case "$mode" in
 (class)
-  nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class 
$command "$@" >> "$log" 2>&1 < /dev/null &
-  newpid="$!"
+  execute_command "nice -n \"$SPARK_NICENESS\" 
\"${SPARK_HOME}/bin/spark-class\" $command $@"
   ;;
 
 (submit)
-  nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-submit 
--class $command "$@" >> "$log" 2>&1 < /dev/null &
-  newpid="$!"
+  execute_command "nice -n \"$SPARK_NICENESS\" 
\"${SPARK_HOME}/bin/spark-submit\" --class $command $@"
--- End diff --

same as above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...

2016-10-11 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/15338#discussion_r82874084
  
--- Diff: sbin/spark-daemon.sh ---
@@ -146,13 +176,11 @@ run_command() {
 
   case "$mode" in
 (class)
-  nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class 
$command "$@" >> "$log" 2>&1 < /dev/null &
-  newpid="$!"
+  execute_command "nice -n \"$SPARK_NICENESS\" 
\"${SPARK_HOME}/bin/spark-class\" $command $@"
--- End diff --

the outer-most quotes aren't required without the eval.

`execute_command nice -n "$SPARK_NICENESS" "${SPARK_HOME}/bin/spark-class" 
$command $@` should do the trick


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...

2016-10-11 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/15338#discussion_r82875044
  
--- Diff: sbin/spark-daemon.sh ---
@@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then
 export SPARK_NICENESS=0
 fi
 
+execute_command() {
+  command="$@"
--- End diff --

spark scripts require bash as interpreter, so we can make this a local 
variable


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...

2016-10-11 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/15338#discussion_r82871978
  
--- Diff: sbin/spark-daemon.sh ---
@@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then
 export SPARK_NICENESS=0
 fi
 
+execute_command() {
+  command="$@"
+  if [ "$SPARK_NO_DAEMONIZE" != "" ]; then
+  eval $command
--- End diff --

I don't think eval is required here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15307
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...

2016-10-11 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/15338#discussion_r82872476
  
--- Diff: sbin/spark-daemon.sh ---
@@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then
 export SPARK_NICENESS=0
 fi
 
+execute_command() {
+  command="$@"
+  if [ "$SPARK_NO_DAEMONIZE" != "" ]; then
+  eval $command
+  else
+  eval "nohup $command >> \"$log\" 2>&1 < /dev/null &"
--- End diff --

same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15307
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66757/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15307
  
**[Test build #66757 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66757/consoleFull)**
 for PR 15307 at commit 
[`dca9939`](https://github.com/apache/spark/commit/dca9939487e77bd739c08149df495b89e5898656).
 * This patch **fails Python style tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15410: [SPARK-17843][Web UI] Indicate event logs pending for pr...

2016-10-11 Thread ajbozarth
Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/15410
  
Just to raise an idea that would possibly mean less code change, would 
simply having a flag that causing a "currently processing applications" type 
message to display without an actual count with it?

Overall I think this is a good addition though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15307
  
**[Test build #66757 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66757/consoleFull)**
 for PR 15307 at commit 
[`dca9939`](https://github.com/apache/spark/commit/dca9939487e77bd739c08149df495b89e5898656).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...

2016-10-11 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9766#discussion_r82876529
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -412,6 +419,63 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
   
//
 
   /**
+   * Register a Java UDF class
+   * @param name
+   * @param className
+   * @param returnType
+   */
+  def registerJava(name: String, className: String, returnType: DataType): 
Unit = {
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...

2016-10-11 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9766#discussion_r82876442
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -412,6 +419,63 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
   
//
 
   /**
+   * Register a Java UDF class
+   * @param name
+   * @param className
+   * @param returnType
+   */
+  def registerJava(name: String, className: String, returnType: DataType): 
Unit = {
+
+try {
+  // scalastyle:off classforname
--- End diff --

This style rule is here to prevent misuse.  Is there a reason we aren't 
using our utility functions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...

2016-10-11 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9766#discussion_r82876419
  
--- Diff: python/pyspark/sql/context.py ---
@@ -202,6 +202,26 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 """
 self.sparkSession.catalog.registerFunction(name, f, returnType)
 
+@ignore_unicode_prefix
+@since(2.1)
+def registerJavaFunction(self, name, javaClassName, 
returnType=StringType()):
+"""Register a java UDF so it can be used in SQL statements.
+
+In addition to a name and the function itself, the return type can 
be optionally specified.
+When the return type is not given it default to a string and 
conversion will automatically
--- End diff --

Where does this conversion happen?  Are we sure that it works (given there 
are no tests that I see).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...

2016-10-11 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9766#discussion_r82876509
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -17,9 +17,15 @@
 
 package org.apache.spark.sql
 
+
+import java.io.IOException
+import java.util.{List => JList, Map => JMap}
+
 import scala.reflect.runtime.universe.TypeTag
 import scala.util.Try
 
+import sun.reflect.generics.reflectiveObjects.ParameterizedTypeImpl
--- End diff --

Is this JVM specific?  What is this being used for?  Is there another way?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-11 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/14690#discussion_r82875191
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -225,13 +225,16 @@ case class FileSourceScanExec(
   }
 
   // These metadata values make scan plans uniquely identifiable for 
equality checking.
-  override val metadata: Map[String, String] = Map(
-"Format" -> relation.fileFormat.toString,
-"ReadSchema" -> outputSchema.catalogString,
-"Batched" -> supportsBatch.toString,
-"PartitionFilters" -> partitionFilters.mkString("[", ", ", "]"),
-"PushedFilters" -> dataFilters.mkString("[", ", ", "]"),
-"InputPaths" -> relation.location.paths.mkString(", "))
+  override val metadata: Map[String, String] = {
+def seqToString(seq: Seq[Any]) = seq.mkString("[", ", ", "]")
+val location = relation.location
+Map("Format" -> relation.fileFormat.toString,
+  "ReadSchema" -> outputSchema.catalogString,
+  "Batched" -> supportsBatch.toString,
+  "PartitionFilters" -> seqToString(partitionFilters),
+  "PushedFilters" -> seqToString(dataFilters),
+  "Location" -> (location.getClass.getSimpleName + 
seqToString(location.rootPaths)))
--- End diff --

Is it different from location.getRootPaths.length? I guess that would only 
be meaningful for Hive-backed tables, but that seems ok.

Btw, you might want to take(10) on the root paths to avoid materializing a 
large string here before it gets truncated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15375
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66750/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15425: [SPARK-17816] [Core] [Branch-2.0] Fix ConcurrentModifica...

2016-10-11 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15425
  
LGTM. Merging to 2.0. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15375
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15375
  
**[Test build #66750 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66750/consoleFull)**
 for PR 15375 at commit 
[`836e874`](https://github.com/apache/spark/commit/836e8745c346c59f78958e10aec1c6f9537242b9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15431: [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiv...

2016-10-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15431


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15431: [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiveBayes ...

2016-10-11 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15431
  
I'll go ahead and merge with master.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82872469
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -176,7 +184,9 @@ class StreamExecution(
   // Mark ACTIVE and then post the event. QueryStarted event is 
synchronously sent to listeners,
   // so must mark this as ACTIVE first.
   state = ACTIVE
-  postEvent(new QueryStarted(this.toInfo)) // Assumption: Does not 
throw exception.
+  
sparkSession.sparkContext.env.metricsSystem.registerSource(streamMetrics)
+  updateStatus()
+  postEvent(new QueryStarted(currentStatus)) // Assumption: Does not 
throw exception.
--- End diff --

it's good to be consistent. Otherwise if you want to change the behavior of 
status for example, then this line will not be left orphaned. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14690
  
**[Test build #66751 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66751/consoleFull)**
 for PR 14690 at commit 
[`2762efd`](https://github.com/apache/spark/commit/2762efd925325a9ab732b34e6672b704dc514062).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14690
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14690
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66751/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15408
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15408
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66749/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15408: [SPARK-17839][CORE] Use Nio's directbuffer instead of Bu...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15408
  
**[Test build #66749 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66749/consoleFull)**
 for PR 15408 at commit 
[`30173fa`](https://github.com/apache/spark/commit/30173facf79e03469291199807f84368a320e262).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-11 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82871238
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala ---
@@ -0,0 +1,146 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import breeze.linalg.normalize
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators}
+import org.apache.spark.ml.param.shared.HasSeed
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
+ * Params for [[RandomProjection]].
+ */
+@Since("2.1.0")
+private[ml] trait RandomProjectionParams extends Params {
+
+  /**
+   * The length of each hash bucket, a larger bucket lowers the false 
negative rate.
+   *
+   * If input vectors are normalized, 1-10 times of pow(numRecords, 
-1/inputDim) would be a
+   * reasonable value
+   * @group param
+   */
+  @Since("2.1.0")
+  val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength",
+"the length of each hash bucket, a larger bucket lowers the false 
negative rate.",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getBucketLength: Double = $(bucketLength)
+}
+
+/**
+ * :: Experimental ::
+ * Model produced by [[RandomProjection]]
+ * @param randUnitVectors An array of random unit vectors. Each vector 
represents a hash function.
+ */
+@Experimental
+@Since("2.1.0")
+class RandomProjectionModel private[ml] (
+override val uid: String,
+val randUnitVectors: Array[Vector])
--- End diff --

I wanted to use `Matrix`. But it turns out that `Array[Vector]` makes 
normalizing each vector, and taking floor of the values much cleaner. Is there 
any reason we should use `Matrix` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15335: [SPARK-17769][Core][Scheduler]Some FetchFailure r...

2016-10-11 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/15335#discussion_r82869819
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1255,27 +1255,46 @@ class DAGScheduler(
   s"longer running")
   }
 
-  if (disallowStageRetryForTest) {
-abortStage(failedStage, "Fetch failure will not retry stage 
due to testing config",
-  None)
-  } else if 
(failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId)) {
-abortStage(failedStage, s"$failedStage (${failedStage.name}) " 
+
-  s"has failed the maximum allowable number of " +
-  s"times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. " +
-  s"Most recent failure reason: ${failureMessage}", None)
-  } else {
-if (failedStages.isEmpty) {
-  // Don't schedule an event to resubmit failed stages if 
failed isn't empty, because
-  // in that case the event will already have been scheduled.
-  // TODO: Cancel running tasks in the stage
-  logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " +
-s"$failedStage (${failedStage.name}) due to fetch failure")
-  messageScheduler.schedule(new Runnable {
-override def run(): Unit = 
eventProcessLoop.post(ResubmitFailedStages)
-  }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)
+  val shouldAbortStage =
+failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId) ||
+disallowStageRetryForTest
+
+  if (shouldAbortStage) {
+val abortMessage = if (disallowStageRetryForTest) {
+  "Fetch failure will not retry stage due to testing config"
+} else {
+  s"""$failedStage (${failedStage.name})
+ |has failed the maximum allowable number of
+ |times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}.
+ |Most recent failure reason: 
$failureMessage""".stripMargin.replaceAll("\n", " ")
 }
+abortStage(failedStage, abortMessage, None)
+  } else { // update failedStages and make sure a 
ResubmitFailedStages event is enqueued
+// TODO: Cancel running tasks in the failed stage -- cf. 
SPARK-17064
+val noResubmitEnqueued = !failedStages.contains(failedStage)
 failedStages += failedStage
 failedStages += mapStage
+if (noResubmitEnqueued) {
+  // We expect one executor failure to trigger many 
FetchFailures in rapid succession,
+  // but all of those task failures can typically be handled 
by a single resubmission of
+  // the failed stage.  We avoid flooding the scheduler's 
event queue with resubmit
+  // messages by checking whether a resubmit is already in the 
event queue for the
+  // failed stage.  If there is already a resubmit enqueued 
for a different failed
+  // stage, that event would also be sufficient to handle the 
current failed stage, but
+  // producing a resubmit for each failed stage makes 
debugging and logging a little
+  // simpler while not producing an overwhelming number of 
scheduler events.
+  logInfo(
+s"Resubmitting $mapStage (${mapStage.name}) and " +
+s"$failedStage (${failedStage.name}) due to fetch failure"
+  )
+  messageScheduler.schedule(
--- End diff --

1. I don't like "Periodically" in your suggested comment, since this is a 
one-shot action after a delay of RESUBMIT_TIMEOUT milliseconds.

2. I agree that this delay-before-resubmit logic is suspect.  If we are 
both thinking correctly that a 200 ms delay on top of the time to re-run the 
`mapStage` is all but inconsequential, then removing it in this PR would be 
fine.  If there are unanticipated consequences, though, I'd prefer to have that 
change in a separate PR.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.d...

2016-10-11 Thread avulanov
Github user avulanov commented on a diff in the pull request:

https://github.com/apache/spark/pull/15382#discussion_r82869165
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -757,7 +758,10 @@ private[sql] class SQLConf extends Serializable with 
CatalystConf with Logging {
 
   def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH)
 
-  def warehousePath: String = new Path(getConf(WAREHOUSE_PATH)).toString
+  def warehousePath: String = {
+val path = new Path(getConf(WAREHOUSE_PATH))
+FileSystem.get(path.toUri, new 
Configuration()).makeQualified(path).toString
--- End diff --

You 
[mentioned](https://github.com/apache/spark/pull/13868#discussion_r82154809) 
that the original issue is as follows: _"...the usages of the new 
makeQualifiedPath are a bit wrong in that they explicitly resolve the path 
against the Hadoop file system, which can be HDFS."_ Should we rather look into 
the code that does `makeQualifiedPath` on `warehousePath` given Hadoop FS 
configuration? The fix would be to have a special case with paths that do not 
have a schema. Actually, could you give a link to this code, I could not find 
it right away?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15421
  
It might not be a R specific issue. I am trying to create a test case on 
Scala side in SQLUtilsSuite.scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15421
  
I think we should find out the root cause of the negative length of "NA" 
field. Yesterday, I debugged R side and I have not found out the reason yet. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15421
  
New test on Mac:

> df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id 
= 1:12)
> 
> dim(createDataFrame(df))
16/10/11 12:10:30 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.IllegalArgumentException: Invalid type N
at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:86)
at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13440
  
**[Test build #66756 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66756/consoleFull)**
 for PR 13440 at commit 
[`83f5e83`](https://github.com/apache/spark/commit/83f5e83fb87407bdd7dc8d740fba6fb30d1da3aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15410: [SPARK-17843][Web UI] Indicate event logs pending for pr...

2016-10-11 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/15410
  
ah, yeah startup would definitely be a good case for this and like I 
mentioned its better then nothing so I'm ok with concept.

I wonder for the other use case where it hasn't looked in ~ 10 seconds if 
it would be more clear to users if  we put a little string at the top that is 
like "last updated time XX:XX:XX"  or app list current as of XX:XX:XX.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15421: [SPARK-17811] SparkR cannot parallelize data.fram...

2016-10-11 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15421#discussion_r82865118
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/SerDe.scala ---
@@ -125,15 +125,24 @@ private[spark] object SerDe {
   }
 
   def readDate(in: DataInputStream): Date = {
-Date.valueOf(readString(in))
+val inStr = readString(in)
--- End diff --

Also seen on Mac. I am building with your new patch to test again on Mac.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15421
  
MacBook Pro (Retina, 15-inch, Mid 2015)
This is the machine that I test the patch on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15421
  
**[Test build #66755 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66755/consoleFull)**
 for PR 15421 at commit 
[`59827a1`](https://github.com/apache/spark/commit/59827a19db93604120dc7229f6ed82777b4cd354).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15421
  
@falaki I saw the exception on Mac too. But I don't find the root cause of 
negative length in the input stream. Catching the exception will solve the 
problem. Do you want to explore the reason of negative index?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread falaki
Github user falaki commented on the issue:

https://github.com/apache/spark/pull/15421
  
@wangmiao1981 thanks for testing on Windows. I added a check for this. 
Would you please try again and let me know? Unfortunately, I don't have access 
to a windows box for testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15421: [SPARK-17811] SparkR cannot parallelize data.fram...

2016-10-11 Thread falaki
Github user falaki commented on a diff in the pull request:

https://github.com/apache/spark/pull/15421#discussion_r82864117
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/SerDe.scala ---
@@ -125,15 +125,24 @@ private[spark] object SerDe {
   }
 
   def readDate(in: DataInputStream): Date = {
-Date.valueOf(readString(in))
+val inStr = readString(in)
--- End diff --

That is only on Windows right? I added a catch for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15335: [SPARK-17769][Core][Scheduler]Some FetchFailure r...

2016-10-11 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/15335#discussion_r82864060
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1255,27 +1255,46 @@ class DAGScheduler(
   s"longer running")
   }
 
-  if (disallowStageRetryForTest) {
-abortStage(failedStage, "Fetch failure will not retry stage 
due to testing config",
-  None)
-  } else if 
(failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId)) {
-abortStage(failedStage, s"$failedStage (${failedStage.name}) " 
+
-  s"has failed the maximum allowable number of " +
-  s"times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. " +
-  s"Most recent failure reason: ${failureMessage}", None)
-  } else {
-if (failedStages.isEmpty) {
-  // Don't schedule an event to resubmit failed stages if 
failed isn't empty, because
-  // in that case the event will already have been scheduled.
-  // TODO: Cancel running tasks in the stage
-  logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " +
-s"$failedStage (${failedStage.name}) due to fetch failure")
-  messageScheduler.schedule(new Runnable {
-override def run(): Unit = 
eventProcessLoop.post(ResubmitFailedStages)
-  }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)
+  val shouldAbortStage =
+failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId) ||
+disallowStageRetryForTest
+
+  if (shouldAbortStage) {
+val abortMessage = if (disallowStageRetryForTest) {
+  "Fetch failure will not retry stage due to testing config"
+} else {
+  s"""$failedStage (${failedStage.name})
+ |has failed the maximum allowable number of
+ |times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}.
+ |Most recent failure reason: 
$failureMessage""".stripMargin.replaceAll("\n", " ")
 }
+abortStage(failedStage, abortMessage, None)
+  } else { // update failedStages and make sure a 
ResubmitFailedStages event is enqueued
+// TODO: Cancel running tasks in the failed stage -- cf. 
SPARK-17064
+val noResubmitEnqueued = !failedStages.contains(failedStage)
--- End diff --

Right here is the only place we put anything into `failedStages`, so 
`failedStage` and `mapStage` always go in as pairs.  The only places where we 
remove things from `failedStages` are `resubmitFailedStages` and  
`DAGScheduler#cleanupStateForJobAndIndependentStages#removeStage`.  We clear 
`failedStages` in `resubmitFailedStages`, so the only place where `failedStage` 
and `mapStage` could get unpaired in `failedStages` is in 
`cleanupStateForJobAndIndependentStages#removeStage`.  That would happen if the 
number of Jobs that use `failedStage` and `mapStage` is unequal.  If I'm 
thinking correctly, that could only happen if the `mapStage` is used by more 
Jobs than is the `failedStage`.  In that case, cleaning up the last Job that 
uses `failedStage` would remove `failedStage` from `failedStages` while 
`mapStage` would remain.

To fall into your proposed `|| !failedStages.contains(mapStage)` branch, 
another `failedStage` needing `mapStage`, this time coming from one of the 
remaining Jobs using `mapStage`, would need to fail.  If that is the case, then 
we still want to log the failure of the new `failedStage`, so I don't think we 
want `|| !failedStages.contains(mapStage)` -- without it, we'll get a duplicate 
of `mapStage` added to `failedStages`, but that's no big deal. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15421: [SPARK-17811] SparkR cannot parallelize data.fram...

2016-10-11 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15421#discussion_r82863921
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/SerDe.scala ---
@@ -125,15 +125,24 @@ private[spark] object SerDe {
   }
 
   def readDate(in: DataInputStream): Date = {
-Date.valueOf(readString(in))
+val inStr = readString(in)
--- End diff --

readString will finally gets a negative len, which causes failure. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14690
  
>> Finally, this would require us to read the schema files. That's 
something I'm trying to avoid in this patch.

> Not sure what you mean here, but the parquet change should be execution 
time only. I'll submit a pr here for that.

Okay. I look forward to seeing that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15431: [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiveBayes ...

2016-10-11 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15431
  
LGTM2
Thanks!
Is it fine with you if this just gets fixed in master, not branch-2.0 
(since the other PR is not in branch-2.0 since it adds a new public API)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15421
  
@falaki I patched your fix to a clean build. I still see the following 
error:

> df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id 
= 1:12)
> 
> dim(createDataFrame(df))
16/10/11 11:51:00 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NegativeArraySizeException
at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110)
at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119)
at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128)
at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77)
at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I am still debugging. It seems that source on R side has some issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11336
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11336
  
**[Test build #66753 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66753/consoleFull)**
 for PR 11336 at commit 
[`8f906a2`](https://github.com/apache/spark/commit/8f906a2e8050c391355c3ddab811d990020728cb).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11336
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66753/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15421
  
hmm, still the same error in the new test case in appveyor
```
Failed 
-
1. Error: SPARK-17811: can create DataFrame containing NA as date and time 
(@test_sparkSQL.R#388) 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 105.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
105.0 (TID 109, localhost): java.lang.NegativeArraySizeException
at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110)
at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119)
at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128)

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread ericl
Github user ericl commented on the issue:

https://github.com/apache/spark/pull/14690
  
> For one thing, a ListingFileCatalog performs a file tree traversal right 
off the bat. However, the external catalog returns the locations of partitions 
as part of the listPartitionsByFilter call. I believe that should suffice for 
the purpose of building a query plan for metastore-backed tables and executing 
it.

You'd have to re-implement a large portion of the parallel traversal logic 
here right? I think we should keep this PR minimal and leave that for future 
work. I am also thinking of adding a per-directory file listing cache as a 
followup to avoid performance regressions, which would likely involve 
refactoring this path anyways.

>I would be wary of amending our data sources to support case-insensitive 
field resolution. For one thing, strictly speaking it can lead to ambiguity in 
schema resolution. In the—potential but unlikely—event that a 
(case-sensitive) data source schema has two distinct fields x1 and x2 such that 
x1.toLowerCase == x2.toLowerCase we're going to get undefined behavior.

> For another, for case-sensitive data sources this adds code complexity in 
their implementation.

I do agree this might be an issue with other datasources. For parquet 
though, I talked with @liancheng and we don't think there are any issues with 
supporting case-insensitive field resolution. Given that, I think we can also 
leave this for future work when we add datasource table support. It might also 
be that we need to add back something like 
https://github.com/apache/spark/pull/14750

> Finally, this would require us to read the schema files. That's something 
I'm trying to avoid in this patch.

Not sure what you mean here, but the parquet change should be execution 
time only. I'll submit a pr here for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15431: [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiveBayes ...

2016-10-11 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15431
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...

2016-10-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15389


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15421
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15421
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66746/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf

2016-10-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15295
  
LGTM except that one comment on naming.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15295: [SPARK-17720][SQL] introduce static SQL conf

2016-10-11 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15295#discussion_r82860005
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala 
---
@@ -132,4 +136,9 @@ class RuntimeConfig private[sql](sqlConf: SQLConf = new 
SQLConf) {
 sqlConf.contains(key)
   }
 
+  private def assertNotGlobalSQLConf(key: String): Unit = {
--- End diff --

rename to assertNonStaticConf


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15295: [SPARK-17720][SQL] introduce static SQL conf

2016-10-11 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15295#discussion_r82859976
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala 
---
@@ -36,6 +37,7 @@ class RuntimeConfig private[sql](sqlConf: SQLConf = new 
SQLConf) {
* @since 2.0.0
*/
   def set(key: String, value: String): Unit = {
+assertNotGlobalSQLConf(key)
--- End diff --

should be assertNonStaticConf here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15421
  
**[Test build #66746 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66746/consoleFull)**
 for PR 15421 at commit 
[`82ec5c8`](https://github.com/apache/spark/commit/82ec5c81e0c0e48bfb6008dcdeef544457cfc014).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14719
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66747/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...

2016-10-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14719
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14719
  
**[Test build #66747 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66747/consoleFull)**
 for PR 14719 at commit 
[`9ad2c85`](https://github.com/apache/spark/commit/9ad2c85241460aa878867a775cd13aca14e72324).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15432: [SPARK-17854][SQL] rand/randn allows null as input seed

2016-10-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15432
  
hm - maybe we should just cast any NullType input into some concrete type 
defined by an ExpectsInputTypes expression?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15437: [SPARK-17876] Write StructuredStreaming WAL to a stream ...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15437
  
**[Test build #66754 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66754/consoleFull)**
 for PR 15437 at commit 
[`4d50be5`](https://github.com/apache/spark/commit/4d50be565841bdb6d647c75e36168151e1e8a621).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82858522
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -189,6 +189,282 @@ def resetTerminated(self):
 self._jsqm.resetTerminated()
 
 
+class StreamingQueryStatus(object):
+"""A class used to report information about the progress of a 
StreamingQuery.
+
+.. note:: Experimental
+
+.. versionadded:: 2.1
+"""
+
+def __init__(self, jsqs):
+self._jsqs = jsqs
+
+def __str__(self):
+"""
+Pretty string of this query status.
+
+>>> print(sqs)
+StreamingQueryStatus:
+Query name: query
+Query id: 1
+Status timestamp: 123
+Input rate: 1.0 rows/sec
+Processing rate 2.0 rows/sec
+Latency: 345.0 ms
+Trigger status:
+key: value
+Source statuses [1 source]:
+Source 1:MySource1
+Available offset: #0
+Input rate: 4.0 rows/sec
+Processing rate: 5.0 rows/sec
+Trigger status:
+key: value
+Sink status: MySink
+Committed offsets: [#1, -]
+"""
+return self._jsqs.toString()
+
+@property
+@ignore_unicode_prefix
+@since(2.1)
+def name(self):
+"""
+Name of the query. This name is unique across all active queries.
+
+>>> sqs.name
+u'query'
+"""
+return self._jsqs.name()
+
+@property
+@since(2.1)
+def id(self):
+"""
+Id of the query. This id is unique across all queries that have 
been started in
+the current process.
+
+>>> int(sqs.id)
+1
+"""
+return self._jsqs.id()
+
+@property
+@since(2.1)
+def timestamp(self):
+"""
+Timestamp (ms) of when this query was generated.
+
+>>> int(sqs.timestamp)
+123
+"""
+return self._jsqs.timestamp()
+
+@property
+@since(2.1)
+def inputRate(self):
+"""
+Current rate (rows/sec) at which data is being generated by all 
the sources.
+
+>>> sqs.inputRate
+1.0
+"""
+return self._jsqs.inputRate()
+
+@property
+@since(2.1)
+def processingRate(self):
+"""
+Current rate (rows/sec) at which the query is processing data from 
all the sources.
+
+>>> sqs.processingRate
+2.0
+"""
+return self._jsqs.processingRate()
+
+@property
+@since(2.1)
+def latency(self):
+"""
+Current average latency between the data being available in source 
and the sink
+writing the corresponding output.
+
+>>> sqs.latency
+345.0
+"""
+if (self._jsqs.latency().nonEmpty()):
+return self._jsqs.latency().get()
+else:
+return None
+
+@property
+@since(2.1)
+def sourceStatuses(self):
+"""
+Current statuses of the sources.
+
+>>> len(sqs.sourceStatuses)
+1
+>>> sqs.sourceStatuses[0].description
+u'MySource1'
+"""
+return [SourceStatus(ss) for ss in self._jsqs.sourceStatuses()]
+
+@property
+@since(2.1)
+def sinkStatus(self):
+"""
+Current status of the sink.
+
+>>> sqs.sinkStatus.description
+u'MySink'
+"""
+return SinkStatus(self._jsqs.sinkStatus())
+
+@property
+@since(2.1)
+def triggerStatus(self):
+"""
+Low-level detailed status of the last completed/currently active 
trigger.
+
+>>> sqs.triggerStatus
+{u'key': u'value'}
--- End diff --

Ah right, this also shown as examples in the doc. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15437: [SPARK-17876] Write StructuredStreaming WAL to a ...

2016-10-11 Thread brkyvz
GitHub user brkyvz opened a pull request:

https://github.com/apache/spark/pull/15437

[SPARK-17876] Write StructuredStreaming WAL to a stream instead of 
materializing all at once

## What changes were proposed in this pull request?

The CompactibleFileStreamLog materializes the whole metadata log in memory 
as a String. This can cause issues when there are lots of files that are being 
committed, especially during a compaction batch.
You may come across stacktraces that look like:
```
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at 
org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
 
```
The safer way is to write to an output stream so that we don't have to 
materialize a huge string.

## How was this patch tested?

Existing unit tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/brkyvz/spark ser-to-stream

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15437.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15437






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15375
  
**[Test build #3324 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3324/consoleFull)**
 for PR 15375 at commit 
[`62ab47b`](https://github.com/apache/spark/commit/62ab47b016aeb42c0721b52c4c37d502db18c535).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82857280
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
 ---
@@ -0,0 +1,244 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming
+
+import java.{util => ju}
+
+import scala.collection.mutable
+
+import com.codahale.metrics.{Gauge, MetricRegistry}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.metrics.source.{Source => CodahaleSource}
+import org.apache.spark.util.Clock
+
+/**
+ * Class that manages all the metrics related to a StreamingQuery. It does 
the following.
+ * - Calculates metrics (rates, latencies, etc.) based on information 
reported by StreamExecution.
+ * - Allows the current metric values to be queried
+ * - Serves some of the metrics through Codahale/DropWizard metrics
+ *
+ * @param sources Unique set of sources in a query
+ * @param triggerClock Clock used for triggering in StreamExecution
+ * @param codahaleSourceName Root name for all the Codahale metrics
+ */
+class StreamMetrics(sources: Set[Source], triggerClock: Clock, 
codahaleSourceName: String)
+  extends CodahaleSource with Logging {
+
+  import StreamMetrics._
+
+  // Trigger infos
+  private val triggerStatus = new mutable.HashMap[String, String]
+  private val sourceTriggerStatus = new mutable.HashMap[Source, 
mutable.HashMap[String, String]]
+
+  // Rate estimators for sources and sinks
+  private val inputRates = new mutable.HashMap[Source, RateCalculator]
+  private val processingRates = new mutable.HashMap[Source, RateCalculator]
+
+  // Number of input rows in the current trigger
+  private val numInputRows = new mutable.HashMap[Source, Long]
+  private var numOutputRows: Option[Long] = None
+  private var currentTriggerStartTimestamp: Long = -1
+  private var previousTriggerStartTimestamp: Long = -1
+  private var latency: Option[Double] = None
+
+  override val sourceName: String = codahaleSourceName
+  override val metricRegistry: MetricRegistry = new MetricRegistry
+
+  // === Initialization ===
+
+  // Metric names should not have . in them, so that all the metrics of a 
query are identified
+  // together in Ganglia as a single metric group
+  registerGauge("inputRate-total", currentInputRate)
+  registerGauge("processingRate-total", () => currentProcessingRate)
+  registerGauge("latency", () => currentLatency().getOrElse(-1.0))
+
+  sources.foreach { s =>
+inputRates.put(s, new RateCalculator)
+processingRates.put(s, new RateCalculator)
+sourceTriggerStatus.put(s, new mutable.HashMap[String, String])
+
+registerGauge(s"inputRate-${s.toString}", () => 
currentSourceInputRate(s))
+registerGauge(s"processingRate-${s.toString}", () => 
currentSourceProcessingRate(s))
+  }
+
+  // === Setter methods ===
+
+  def reportTriggerStarted(triggerId: Long): Unit = synchronized {
+numInputRows.clear()
+numOutputRows = None
+triggerStatus.clear()
+sourceTriggerStatus.values.foreach(_.clear())
+
+reportTriggerStatus(TRIGGER_ID, triggerId)
+sources.foreach(s => reportSourceTriggerStatus(s, TRIGGER_ID, 
triggerId))
+reportTriggerStatus(ACTIVE, true)
+currentTriggerStartTimestamp = triggerClock.getTimeMillis()
+reportTriggerStatus(START_TIMESTAMP, currentTriggerStartTimestamp)
+  }
+
+  def reportTriggerStatus[T](key: String, value: T): Unit = synchronized {
+triggerStatus.put(key, value.toString)
+  }
+
+  def reportSourceTriggerStatus[T](source: Source, key: String, value: T): 
Unit = synchronized {
+sourceTriggerStatus(source).put(key, value.toString)
+  }
+
+  def reportNumInputRows(inputRows: Map[Source, Long]): Unit = 

[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82856942
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -530,7 +692,7 @@ class StreamExecution(
   case object TERMINATED extends State
 }
 
-object StreamExecution {
+object StreamExecution extends Logging {
--- End diff --

removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82857011
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -516,12 +563,127 @@ class StreamExecution(
  """.stripMargin
   }
 
-  private def toInfo: StreamingQueryInfo = {
-new StreamingQueryInfo(
-  this.name,
-  this.id,
-  this.sourceStatuses,
-  this.sinkStatus)
+  /**
+   * Report row metrics of the executed trigger
+   * @param triggerExecutionPlan Execution plan of the trigger
+   * @param triggerLogicalPlan Logical plan of the trigger, generated from 
the query logical plan
+   * @param sourceToDF Source to DataFrame returned by the source.getBatch 
in this trigger
+   */
+  private def reportNumRows(
+  triggerExecutionPlan: SparkPlan,
+  triggerLogicalPlan: LogicalPlan,
+  sourceToDF: Map[Source, DataFrame]): Unit = {
+// We want to associate execution plan leaves to sources that generate 
them, so that we match
+// the their metrics (e.g. numOutputRows) to the sources. To do this 
we do the following.
+// Consider the translation from the streaming logical plan to the 
final executed plan.
+//
+//  streaming logical plan (with sources) <==> trigger's logical plan 
<==> executed plan
+//
+// 1. We keep track of streaming sources associated with each leaf in 
the trigger's logical plan
+//- Each logical plan leaf will be associated with a single 
streaming source.
+//- There can be multiple logical plan leaves associated a 
streaming source.
--- End diff --

fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82856924
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -221,8 +247,15 @@ class StreamExecution(
 }
 } finally {
   state = TERMINATED
+
+  // Update metrics and status
+  streamMetrics.stop()
+  
sparkSession.sparkContext.env.metricsSystem.removeSource(streamMetrics)
+  updateStatus()
+
+  // Notify others
   sparkSession.streams.notifyQueryTermination(StreamExecution.this)
-  postEvent(new QueryTerminated(this.toInfo, 
exception.map(_.cause).map(Utils.exceptionString)))
+  postEvent(new QueryTerminated(status, 
exception.map(_.cause).map(Utils.exceptionString)))
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82856802
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -105,11 +105,21 @@ class StreamExecution(
   var lastExecution: QueryExecution = null
 
   @volatile
-  var streamDeathCause: StreamingQueryException = null
+  private var streamDeathCause: StreamingQueryException = null
 
   /* Get the call site in the caller thread; will pass this into the micro 
batch thread */
   private val callSite = Utils.getCallSite()
 
+  /** Metrics for this query */
+  private val streamMetrics =
+new StreamMetrics(uniqueSources.toSet, triggerClock, 
s"StructuredStreaming.$name")
+
+  @volatile
+  private var currentStatus: StreamingQueryStatus = null
+
+  @volatile
+  private var metricWarningLogged: Boolean = false
--- End diff --

Added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82856363
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
 ---
@@ -0,0 +1,244 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming
+
+import java.{util => ju}
+
+import scala.collection.mutable
+
+import com.codahale.metrics.{Gauge, MetricRegistry}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.metrics.source.{Source => CodahaleSource}
+import org.apache.spark.util.Clock
+
+/**
+ * Class that manages all the metrics related to a StreamingQuery. It does 
the following.
+ * - Calculates metrics (rates, latencies, etc.) based on information 
reported by StreamExecution.
+ * - Allows the current metric values to be queried
+ * - Serves some of the metrics through Codahale/DropWizard metrics
+ *
+ * @param sources Unique set of sources in a query
+ * @param triggerClock Clock used for triggering in StreamExecution
+ * @param codahaleSourceName Root name for all the Codahale metrics
+ */
+class StreamMetrics(sources: Set[Source], triggerClock: Clock, 
codahaleSourceName: String)
+  extends CodahaleSource with Logging {
+
+  import StreamMetrics._
+
+  // Trigger infos
+  private val triggerStatus = new mutable.HashMap[String, String]
+  private val sourceTriggerStatus = new mutable.HashMap[Source, 
mutable.HashMap[String, String]]
+
+  // Rate estimators for sources and sinks
+  private val inputRates = new mutable.HashMap[Source, RateCalculator]
+  private val processingRates = new mutable.HashMap[Source, RateCalculator]
+
+  // Number of input rows in the current trigger
+  private val numInputRows = new mutable.HashMap[Source, Long]
+  private var numOutputRows: Option[Long] = None
+  private var currentTriggerStartTimestamp: Long = -1
+  private var previousTriggerStartTimestamp: Long = -1
+  private var latency: Option[Double] = None
+
+  override val sourceName: String = codahaleSourceName
+  override val metricRegistry: MetricRegistry = new MetricRegistry
+
+  // === Initialization ===
+
+  // Metric names should not have . in them, so that all the metrics of a 
query are identified
+  // together in Ganglia as a single metric group
+  registerGauge("inputRate-total", currentInputRate)
+  registerGauge("processingRate-total", () => currentProcessingRate)
+  registerGauge("latency", () => currentLatency().getOrElse(-1.0))
+
+  sources.foreach { s =>
+inputRates.put(s, new RateCalculator)
+processingRates.put(s, new RateCalculator)
+sourceTriggerStatus.put(s, new mutable.HashMap[String, String])
+
+registerGauge(s"inputRate-${s.toString}", () => 
currentSourceInputRate(s))
+registerGauge(s"processingRate-${s.toString}", () => 
currentSourceProcessingRate(s))
+  }
+
+  // === Setter methods ===
+
+  def reportTriggerStarted(triggerId: Long): Unit = synchronized {
+numInputRows.clear()
+numOutputRows = None
+triggerStatus.clear()
+sourceTriggerStatus.values.foreach(_.clear())
+
+reportTriggerStatus(TRIGGER_ID, triggerId)
+sources.foreach(s => reportSourceTriggerStatus(s, TRIGGER_ID, 
triggerId))
+reportTriggerStatus(ACTIVE, true)
+currentTriggerStartTimestamp = triggerClock.getTimeMillis()
+reportTriggerStatus(START_TIMESTAMP, currentTriggerStartTimestamp)
+  }
+
+  def reportTriggerStatus[T](key: String, value: T): Unit = synchronized {
+triggerStatus.put(key, value.toString)
+  }
+
+  def reportSourceTriggerStatus[T](source: Source, key: String, value: T): 
Unit = synchronized {
+sourceTriggerStatus(source).put(key, value.toString)
+  }
+
+  def reportNumInputRows(inputRows: Map[Source, Long]): Unit = 

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14690
  
I believe that using a method like `TableFileCatalog.filterPartitions` to 
build a new file catalog restricted to some pruned partitions is a sound 
approach, however I'm starting to reconsider the implementation. Specifically, 
I'm thinking that having that method return a `ListingFileCatalog` is the wrong 
thing to do. It may be unnecessarily heavy handed.

For one thing, a `ListingFileCatalog` performs a file tree traversal right 
off the bat. However, the external catalog returns the locations of partitions 
as part of the `listPartitionsByFilter` call. I believe that should suffice for 
the purpose of building a query plan for metastore-backed tables and executing 
it.

The current implementation works, but I'm going to explore the latter 
possibility as a more efficient implementation. If I can make it work without 
too much complexity I'll probably just push it as a new commit to this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11336: [SPARK-9325][SPARK-R] collect() head() and show() for Co...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11336
  
**[Test build #66753 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66753/consoleFull)**
 for PR 11336 at commit 
[`8f906a2`](https://github.com/apache/spark/commit/8f906a2e8050c391355c3ddab811d990020728cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82855608
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -516,12 +563,127 @@ class StreamExecution(
  """.stripMargin
   }
 
-  private def toInfo: StreamingQueryInfo = {
-new StreamingQueryInfo(
-  this.name,
-  this.id,
-  this.sourceStatuses,
-  this.sinkStatus)
+  /**
+   * Report row metrics of the executed trigger
+   * @param triggerExecutionPlan Execution plan of the trigger
+   * @param triggerLogicalPlan Logical plan of the trigger, generated from 
the query logical plan
+   * @param sourceToDF Source to DataFrame returned by the source.getBatch 
in this trigger
+   */
+  private def reportNumRows(
+  triggerExecutionPlan: SparkPlan,
+  triggerLogicalPlan: LogicalPlan,
+  sourceToDF: Map[Source, DataFrame]): Unit = {
+// We want to associate execution plan leaves to sources that generate 
them, so that we match
+// the their metrics (e.g. numOutputRows) to the sources. To do this 
we do the following.
+// Consider the translation from the streaming logical plan to the 
final executed plan.
+//
+//  streaming logical plan (with sources) <==> trigger's logical plan 
<==> executed plan
+//
+// 1. We keep track of streaming sources associated with each leaf in 
the trigger's logical plan
+//- Each logical plan leaf will be associated with a single 
streaming source.
+//- There can be multiple logical plan leaves associated a 
streaming source.
+//- There can be leaves not associated with any streaming source, 
because they were
+//  generated from a batch source (e.g. stream-batch joins)
+//
+// 2. Assuming that the executed plan has same number of leaves in the 
same order as that of
+//the trigger logical plan, we associate executed plan leaves with 
corresponding
+//streaming sources.
+//
+// 3. For each source, we sum the metrics of the associated execution 
plan leaves.
+//
+val logicalPlanLeafToSource = sourceToDF.flatMap { case (source, df) =>
+  df.logicalPlan.collectLeaves().map { leaf => leaf -> source }
+}
+val allLogicalPlanLeaves = triggerLogicalPlan.collectLeaves() // 
includes non-streaming sources
+val allExecPlanLeaves = triggerExecutionPlan.collectLeaves()
+val sourceToNumInputRows: Map[Source, Long] =
+  if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) {
+val execLeafToSource = 
allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap {
--- End diff --

See the condition in the previous line :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15307: [SPARK-17731][SQL][STREAMING] Metrics for structu...

2016-10-11 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15307#discussion_r82855520
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -176,7 +184,9 @@ class StreamExecution(
   // Mark ACTIVE and then post the event. QueryStarted event is 
synchronously sent to listeners,
   // so must mark this as ACTIVE first.
   state = ACTIVE
-  postEvent(new QueryStarted(this.toInfo)) // Assumption: Does not 
throw exception.
+  
sparkSession.sparkContext.env.metricsSystem.registerSource(streamMetrics)
--- End diff --

This is not expensive. A lot of these sources are registered by default in 
the SparkContext, the additional overhead is minimal. Also, the overhead 
depends solely on how frequently the Dropwizard metrics polls for the values of 
rates, latencies, etc. Typically this is every 10s of seconds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15436: [SPARK-17875] [BUILD] Remove unneeded direct depe...

2016-10-11 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15436#discussion_r82854783
  
--- Diff: NOTICE ---
@@ -162,7 +162,7 @@ Please visit the Netty web site for more information:
 
   * http://netty.io/
 
-Copyright 2011 The Netty Project
+Copyright 2014 The Netty Project
--- End diff --

The license changes are overdue update of the Netty 4.x license from 3.x's 
version


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15436: [SPARK-17875] [BUILD] Remove unneeded direct depe...

2016-10-11 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15436#discussion_r82854900
  
--- Diff: dev/deps/spark-deps-hadoop-2.3 ---
@@ -130,7 +130,6 @@ metrics-json-3.1.2.jar
 metrics-jvm-3.1.2.jar
 minlog-1.3.0.jar
 mx4j-3.0.2.jar
-netty-3.8.0.Final.jar
--- End diff --

I can't quite figure this out: Hadoop 2.2 and 2.6-2.7 do transitively 
depend on Netty 3.6.x. 2.3 and 2.4 do not. _shrug_


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15436: [SPARK-17875] [BUILD] Remove unneeded direct dependence ...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15436
  
**[Test build #66752 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66752/consoleFull)**
 for PR 15436 at commit 
[`a5c5c31`](https://github.com/apache/spark/commit/a5c5c3146e702a5c6ac8a86648f58f44d13a95f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15436: [SPARK-17875] [BUILD] Remove unneeded direct depe...

2016-10-11 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/15436

[SPARK-17875] [BUILD] Remove unneeded direct dependence on Netty 3.x

## What changes were proposed in this pull request?

Remove unneeded direct dependency on Netty 3.x. I left the 
`dependencyManagement` entry because some Hadoop libs still use an older 
version of Netty 3, and I thought it would be weird if the transitive version 
we reference went backwards. (Note too that Flume declares a direct separate 
dependency in test scope on Netty 3.4.x)


## How was this patch tested?

Existing tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-17875

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15436.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15436


commit a5c5c3146e702a5c6ac8a86648f58f44d13a95f2
Author: Sean Owen 
Date:   2016-10-11T18:10:08Z

Remove unneeded direct dependency on Netty 3.x




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15190: [SPARK-17620][SQL] Determine Serde by hive.default.filef...

2016-10-11 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/15190
  
@yhuai We will use Parquet format in your example. We look at ```SQL 
spark.sql.sources.default ``` configuration to decide on the format to use ?

Here is the output for your perusal.

``` SQL
spark-sql> set spark.sql.hive.convertCTAS=true;
spark.sql.hive.convertCTAS  true
Time taken: 3.309 seconds, Fetched 1 row(s)
spark-sql> set hive.default.fileformat=orc;
hive.default.fileformat orc
Time taken: 0.053 seconds, Fetched 1 row(s)
spark-sql> CREATE TABLE IF NOT EXISTS test select 1 from foo;
spark-sql> describe formatted test;
...
# Storage Information   
SerDe Library:  
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe 
InputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat   
OutputFormat:   
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat  
```

Now change ```spark.sql.sources.default=orc```
```SQL
spark-sql> set spark.sql.sources.default=orc;
spark.sql.sources.default   orc
spark-sql> CREATE TABLE IF NOT EXISTS test2 select 1 from foo;
Time taken: 0.451 seconds
spark-sql> describe formatted test2;
...
# Storage Information   
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde   
InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
``` 

Please let me know if you have any further questions.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15410: [SPARK-17843][Web UI] Indicate event logs pending for pr...

2016-10-11 Thread vijoshi
Github user vijoshi commented on the issue:

https://github.com/apache/spark/pull/15410
  
@tgravescs - you're right - for newer logs that are generated, there could 
be a window of time (10 secs or whatever the user configures) where the new 
logs are not picked up for replay and the UI doesn't say anything about them. 
however the issue we see is more with old completed apps. a little after 
history server startup, user browses to the app list and has no idea why the 
older completed apps are missing (perhaps those that were visible just before 
the history server was restarted). since the polling for logs/replay is 
scheduled immediately as part of history server startup (zero delay for the 
first round), this fix could help these cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15422
  
> For example, in MR you have the ability to even set the percentage of bad 
records you want to tolerate (we dont have that in spark).

I may be wrong. But in MR, I think bad records just means `map` or `reduce` 
throws an exception. It's not related to any IOException (including 
EOFExcpetion). 
(https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L1490)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14690
  
**[Test build #66751 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66751/consoleFull)**
 for PR 14690 at commit 
[`2762efd`](https://github.com/apache/spark/commit/2762efd925325a9ab732b34e6672b704dc514062).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14690
  
Ah cripes. I committed something I didn't want to. I'm rebasing again in a 
few...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2016-10-11 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/15435
  
I'll try to take a look before too long. For now, I see there are no tests, 
could you please add tests, using the summary tests for binary classification 
as a guide? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15384: [SPARK-17346][SQL][Tests]Fix the flaky topic dele...

2016-10-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15384


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15384: [SPARK-17346][SQL][Tests]Fix the flaky topic deletion in...

2016-10-11 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/15384
  
Merging it to master, and branch 2.0 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-11 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14690#discussion_r82850487
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -225,13 +225,16 @@ case class FileSourceScanExec(
   }
 
   // These metadata values make scan plans uniquely identifiable for 
equality checking.
-  override val metadata: Map[String, String] = Map(
-"Format" -> relation.fileFormat.toString,
-"ReadSchema" -> outputSchema.catalogString,
-"Batched" -> supportsBatch.toString,
-"PartitionFilters" -> partitionFilters.mkString("[", ", ", "]"),
-"PushedFilters" -> dataFilters.mkString("[", ", ", "]"),
-"InputPaths" -> relation.location.paths.mkString(", "))
+  override val metadata: Map[String, String] = {
+def seqToString(seq: Seq[Any]) = seq.mkString("[", ", ", "]")
+val location = relation.location
+Map("Format" -> relation.fileFormat.toString,
+  "ReadSchema" -> outputSchema.catalogString,
+  "Batched" -> supportsBatch.toString,
+  "PartitionFilters" -> seqToString(partitionFilters),
+  "PushedFilters" -> seqToString(dataFilters),
+  "Location" -> (location.getClass.getSimpleName + 
seqToString(location.rootPaths)))
--- End diff --

I'm going to pull this value out to a local `val`. The reason I don't want 
to make this part of the `BasicFileCatalog` type is that this format seems 
specific to this class's `metadata` method. For example, we're using the 
method-local `seqToString` method to format the `rootPaths` list.

Regarding the partition count, this is actually a little trickier than I 
expected. I'm going to expand on that in a standalone PR comment after the next 
push.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15422
  
@mridulm for the scenario you're imagining, maybe the data is OK, sure. 
That doesn't mean it's true in all cases. Yeah, this is really to work around 
bad input, which you can to some degree do at the user level. Other parts of 
Spark don't work this way. I'm neutral on whether this is a good idea at all, 
but would prefer consistency more than anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13194: [SPARK-15402] [ML] [PySpark] PySpark ml.evaluation shoul...

2016-10-11 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/13194
  
Just one question, otherwise LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13194: [SPARK-15402] [ML] [PySpark] PySpark ml.evaluatio...

2016-10-11 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/13194#discussion_r82849461
  
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -21,7 +21,8 @@
 from pyspark.ml.wrapper import JavaParams
 from pyspark.ml.param import Param, Params, TypeConverters
 from pyspark.ml.param.shared import HasLabelCol, HasPredictionCol, 
HasRawPredictionCol
-from pyspark.ml.common import inherit_doc
+from pyspark.ml.util import JavaMLReadable, JavaMLWritable
+from pyspark.mllib.common import inherit_doc
--- End diff --

Just wondering why change to import from `mllib.common`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15398: [SPARK-17647][SQL] Fix backslash escaping in 'LIK...

2016-10-11 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/15398#discussion_r82849408
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala
 ---
@@ -25,26 +25,25 @@ object StringUtils {
 
   // replace the _ with .{1} exactly match 1 time of any character
   // replace the % with .*, match 0 or more times with any character
-  def escapeLikeRegex(v: String): String = {
-if (!v.isEmpty) {
-  "(?s)" + (' ' +: v.init).zip(v).flatMap {
-case (prev, '\\') => ""
-case ('\\', c) =>
-  c match {
-case '_' => "_"
-case '%' => "%"
-case _ => Pattern.quote("\\" + c)
-  }
-case (prev, c) =>
-  c match {
-case '_' => "."
-case '%' => ".*"
-case _ => Pattern.quote(Character.toString(c))
-  }
-  }.mkString
-} else {
-  v
+  def escapeLikeRegex(str: String): String = {
+val builder = new StringBuilder()
+var escaping = false
+for (next <- str) {
+  if (escaping) {
+builder ++= Pattern.quote(Character.toString(next))
--- End diff --

Every character after a backslash is quoted, so `\\a` becomes `\Q\\E\Qa\E`. 
Is this not the intended behaviour?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15422
  
@marmbrus +1 on logging, that is definitely something which was probably 
missed here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   >