[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-04-13 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493350#comment-14493350
 ] 

Yi Zhou commented on SPARK-5791:


[~yhuai], yes, Both used Parquet.

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physcial_Plan_Hive.txt, 
 Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4638:
---

Assignee: Apache Spark

 Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to 
 find non linear boundaries
 ---

 Key: SPARK-4638
 URL: https://issues.apache.org/jira/browse/SPARK-4638
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: madankumar s
Assignee: Apache Spark
  Labels: Gaussian, Kernels, SVM
 Attachments: kernels-1.3.patch


 SPARK MLlib Classification Module:
 Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493347#comment-14493347
 ] 

Apache Spark commented on SPARK-4638:
-

User 'mandar2812' has created a pull request for this issue:
https://github.com/apache/spark/pull/5503

 Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to 
 find non linear boundaries
 ---

 Key: SPARK-4638
 URL: https://issues.apache.org/jira/browse/SPARK-4638
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: madankumar s
  Labels: Gaussian, Kernels, SVM
 Attachments: kernels-1.3.patch


 SPARK MLlib Classification Module:
 Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4638:
---

Assignee: (was: Apache Spark)

 Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to 
 find non linear boundaries
 ---

 Key: SPARK-4638
 URL: https://issues.apache.org/jira/browse/SPARK-4638
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: madankumar s
  Labels: Gaussian, Kernels, SVM
 Attachments: kernels-1.3.patch


 SPARK MLlib Classification Module:
 Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4766) ML Estimator Params should subclass Transformer Params

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4766:
-
Description: 
Currently, in spark.ml, both Transformers and Estimators extend the same Params 
classes.  There should be one Params class for the Transformer and one for the 
Estimator, where the Estimator params class extends the Transformer one.

E.g., it is weird to be able to do:
{code}
val model: LogisticRegressionModel = ...
model.getMaxIter()
{code}

It's also weird to be able to:
* Wrap LogisticRegressionModel (a Transformer) with CrossValidator
* Pass a set of ParamMaps to CrossValidator which includes parameter 
LogisticRegressionModel.maxIter
* (CrossValidator would try to set that parameter.)
* I'm not sure if this would cause a failure or just be a noop.

  was:
Currently, in spark.ml, both Transformers and Estimators extend the same Params 
classes.  There should be one Params class for the Transformer and one for the 
Estimator, where the Estimator params class extends the Transformer one.

E.g., it is weird to be able to do:
{code}
val model: LogisticRegressionModel = ...
model.getMaxIter()
{code}

(This is the only case where this happens currently, but it is worth setting a 
precedent.)


 ML Estimator Params should subclass Transformer Params
 --

 Key: SPARK-4766
 URL: https://issues.apache.org/jira/browse/SPARK-4766
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 Currently, in spark.ml, both Transformers and Estimators extend the same 
 Params classes.  There should be one Params class for the Transformer and one 
 for the Estimator, where the Estimator params class extends the Transformer 
 one.
 E.g., it is weird to be able to do:
 {code}
 val model: LogisticRegressionModel = ...
 model.getMaxIter()
 {code}
 It's also weird to be able to:
 * Wrap LogisticRegressionModel (a Transformer) with CrossValidator
 * Pass a set of ParamMaps to CrossValidator which includes parameter 
 LogisticRegressionModel.maxIter
 * (CrossValidator would try to set that parameter.)
 * I'm not sure if this would cause a failure or just be a noop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6892) Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode

2015-04-13 Thread yangping wu (JIRA)
yangping wu created SPARK-6892:
--

 Summary: Recovery from checkpoint will also reuse the application 
id when write eventLog in yarn-cluster mode
 Key: SPARK-6892
 URL: https://issues.apache.org/jira/browse/SPARK-6892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu
Priority: Critical


When I recovery from checkpoint  in yarn-cluster mode using Spark Streaming,  I 
found it will reuse the application id (In my case is 
application_1428664056212_0016) before falid to write spark eventLog, But now 
my application id is application_1428664056212_0017,then spark write eventLog 
will falid, the stacktrace as follow:
{code}
15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, 
java.io.IOException: Target log file already exists 
(hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
java.io.IOException: Target log file already exists 
(hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6893) Better handling of pipeline parameters in PySpark

2015-04-13 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6893:


 Summary: Better handling of pipeline parameters in PySpark
 Key: SPARK-6893
 URL: https://issues.apache.org/jira/browse/SPARK-6893
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


This is SPARK-5957 for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493585#comment-14493585
 ] 

Kannan Rajah commented on SPARK-6511:
-

As requested by Patrick, here is an example of what we use in spark-env.sh for 
MapR distribution.

MAPR_HADOOP_CLASSPATH=`hadoop classpath`
MAPR_SPARK_CLASSPATH=$MAPR_HADOOP_CLASSPATH:$MAPR_HADOOP_HBASE_VERSION

MAPR_HADOOP_JNI_PATH=`hadoop jnipath`

export SPARK_LIBRARY_PATH=$MAPR_HADOOP_JNI_PATH

SPARK_SUBMIT_CLASSPATH=$SPARK_SUBMIT_CLASSPATH:$MAPR_SPARK_CLASSPATH
SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:$MAPR_HADOOP_JNI_PATH

export SPARK_SUBMIT_CLASSPATH
export SPARK_SUBMIT_LIBRARY_PATH


 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493592#comment-14493592
 ] 

Kannan Rajah commented on SPARK-6511:
-

[~pwendell] Just wanted to let you know that we also have a way to add hive and 
hbase jars to the classpath. This is useful when a setup has multiple versions 
of hive and hbase installed, but a Spark version will only work with specific 
version. We have some utility scripts to generate the right classpath entries 
based on a supported version of hive, hbase. If you think this will be useful 
in Apache distribution, I can create a JIRA and share the code. At a high 
level, there are 3 files:

- compatibility.version: File that holds supported versions for each ecosystem 
component.
hive_versions=0.13,0.12
hbase_versions=0.98

- compatible_version.sh: Returns the compatible version for a component by 
looking up compatibilty.version file. The first version that is available on 
the node is used.

- generate_classpath.sh: Uses the above 2 files to generate the classpath. This 
script is used in spark-env.sh to generate classpath based on hive and hbase.

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5924) Add the ability to specify withMean or withStd parameters with StandarScaler

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5924:
---

Assignee: (was: Apache Spark)

 Add the ability to specify withMean or withStd parameters with StandarScaler
 

 Key: SPARK-5924
 URL: https://issues.apache.org/jira/browse/SPARK-5924
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Jao Rabary
Priority: Trivial

 The current implementation of StandarScaler calls 
 mllib.feature.StandardScaler default constructor directly without offering 
 the ability to specify withMean or withStd parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

2015-04-13 Thread Jack Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493517#comment-14493517
 ] 

Jack Hu commented on SPARK-6847:


Here is the part of the stack (Full stack at: 
https://gist.github.com/jhu-chang/38a6c052aff1d666b785)
{quote}
15/04/14 11:28:20 [Executor task launch worker-1] ERROR 
org.apache.spark.executor.Executor: Exception in task 1.0 in stage 27554.0 (TID 
3801)
java.lang.StackOverflowError
at 
java.io.ObjectStreamClass.setPrimFieldValues(ObjectStreamClass.java:1243)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1984)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at scala.collection.immutable.$colon$colon.readObject(List.scala:366)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
{quote}

 Stack overflow on updateStateByKey which followed by a dstream with 
 checkpoint set
 --

 Key: SPARK-6847
 URL: https://issues.apache.org/jira/browse/SPARK-6847
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Jack Hu
  Labels: StackOverflowError, Streaming

 The issue happens with the following sample code: uses {{updateStateByKey}} 
 followed by a {{map}} with checkpoint 

[jira] [Updated] (SPARK-6892) Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode

2015-04-13 Thread yangping wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yangping wu updated SPARK-6892:
---
Description: 
When I recovery from checkpoint  in yarn-cluster mode using Spark Streaming,  I 
found it will reuse the application id (In my case is 
application_1428664056212_0016) before falied to write spark eventLog, But now 
my application id is application_1428664056212_0017,then spark write eventLog 
will falied, the stacktrace as follow:
{code}
15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, 
java.io.IOException: Target log file already exists 
(hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
java.io.IOException: Target log file already exists 
(hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{code}

This exception will cause the job falied.

  was:
When I recovery from checkpoint  in yarn-cluster mode using Spark Streaming,  I 
found it will reuse the application id (In my case is 
application_1428664056212_0016) before falid to write spark eventLog, But now 
my application id is application_1428664056212_0017,then spark write eventLog 
will falid, the stacktrace as follow:
{code}
15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, 
java.io.IOException: Target log file already exists 
(hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
java.io.IOException: Target log file already exists 
(hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{code}


 Recovery from checkpoint will also reuse the application id when write 
 eventLog in yarn-cluster mode
 

 Key: SPARK-6892
 URL: https://issues.apache.org/jira/browse/SPARK-6892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu
Priority: Critical

 When I recovery from checkpoint  in yarn-cluster mode using Spark Streaming,  
 I found it will reuse the application id (In my case is 
 application_1428664056212_0016) before falied to write spark eventLog, But 
 now my application id is application_1428664056212_0017,then spark write 
 eventLog will falied, the stacktrace as follow:
 {code}
 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' 
 failed, java.io.IOException: Target log file already exists 
 (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
 java.io.IOException: Target log file already exists 
 (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
   at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
   at scala.Option.foreach(Option.scala:236)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107)
   at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 {code}
 This exception will cause the job falied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5957) Better handling of default parameter values.

2015-04-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5957.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5431
[https://github.com/apache/spark/pull/5431]

 Better handling of default parameter values.
 

 Key: SPARK-5957
 URL: https://issues.apache.org/jira/browse/SPARK-5957
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 We store the default value of a parameter in the Param instance. In many 
 cases, the default value depends on the algorithm and other parameters 
 defined in the same algorithm. We need to think a better approach to handle 
 default parameter values.
 The design doc was posted in the parent JIRA: 
 https://issues.apache.org/jira/browse/SPARK-5874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492868#comment-14492868
 ] 

Max Kaznady commented on SPARK-6884:


Implemented a prototype, testing mapReduce code.

 random forest predict probabilities functionality (like in sklearn)
 ---

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
 Environment: cross-platform
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492871#comment-14492871
 ] 

Max Kaznady commented on SPARK-3727:


I thought it would be more fitting to separate this: 
https://issues.apache.org/jira/browse/SPARK-6884

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6884) Random forest: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Summary: Random forest: predict class probabilities  (was: random forest 
predict probabilities functionality (like in sklearn))

 Random forest: predict class probabilities
 --

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
 Environment: cross-platform
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-3727

 random forest predict probabilities functionality (like in sklearn)
 ---

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
 Environment: cross-platform
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6883) Fork pyspark's cloudpickle as a separate dependency

2015-04-13 Thread Kyle Kelley (JIRA)
Kyle Kelley created SPARK-6883:
--

 Summary: Fork pyspark's cloudpickle as a separate dependency
 Key: SPARK-6883
 URL: https://issues.apache.org/jira/browse/SPARK-6883
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Kyle Kelley


IPython, pyspark, picloud/multyvac/cloudpipe all rely on cloudpickle from 
various sources (cloud, pyspark, and multyvac correspondingly). It would be 
great to have this as a separately maintained project that can:

* Work with Python3
* Add tests!
* Use higher order pickling (when on Python3)
* Be installed with pip

We're starting this off at the PyCon sprints under 
https://github.com/cloudpipe/cloudpickle. We'd like to coordinate with PySpark 
to make it work across all the above mentioned projects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile

2015-04-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6352:
--
Assignee: Pei-Lun Lee

 Supporting non-default OutputCommitter when using saveAsParquetFile
 ---

 Key: SPARK-6352
 URL: https://issues.apache.org/jira/browse/SPARK-6352
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.1, 1.3.0
Reporter: Pei-Lun Lee
Assignee: Pei-Lun Lee
 Fix For: 1.4.0


 SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can 
 be nice to have similar behavior in saveAsParquetFile. It maybe difficult to 
 have a fully customizable OutputCommitter solution, at least adding something 
 like a DirectParquetOutputCommitter and letting users choose between this and 
 the default should be enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile

2015-04-13 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492699#comment-14492699
 ] 

Josh Rosen commented on SPARK-6352:
---

[~lian cheng], we can only assign tickets to users who have the proper role in 
Spark's JIRA permissions.  I've added [~pllee] to the Contributors role and 
will assign this ticket to them. 

 Supporting non-default OutputCommitter when using saveAsParquetFile
 ---

 Key: SPARK-6352
 URL: https://issues.apache.org/jira/browse/SPARK-6352
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.1, 1.3.0
Reporter: Pei-Lun Lee
 Fix For: 1.4.0


 SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can 
 be nice to have similar behavior in saveAsParquetFile. It maybe difficult to 
 have a fully customizable OutputCommitter solution, at least adding something 
 like a DirectParquetOutputCommitter and letting users choose between this and 
 the default should be enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-04-13 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-5888:
-

Assignee: Sandy Ryza

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza

 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6849) The constructor of GradientDescent should be public

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6849.
--
Resolution: Duplicate

Yes, I think this is a subset of opening up optimization APIs

 The constructor of GradientDescent should be public
 ---

 Key: SPARK-6849
 URL: https://issues.apache.org/jira/browse/SPARK-6849
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Guoqiang Li
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5632) not able to resolve dot('.') in field name

2015-04-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5632:
---
Description: 
My cassandra table task_trace has a field sm.result which contains dot in the 
name. So SQL tried to look up sm instead of full name 'sm.result'. 

Here is my code: 
{code}
scala import org.apache.spark.sql.cassandra.CassandraSQLContext
scala val cc = new CassandraSQLContext(sc)
scala val task_trace = cc.jsonFile(/task_trace.json)
scala task_trace.registerTempTable(task_trace)
scala cc.setKeyspace(cerberus_data_v4)
scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, 
task_body.sm.result FROM task_trace WHERE task_id = 
'fff7304e-9984-4b45-b10c-0423a96745ce')
res: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[57] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
cerberus_id, couponId, coupon_code, created, description, domain, expires, 
message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, validity
{code}

The full schema look like this:
{code}
scala task_trace.printSchema()
root
 \|-- received_datetime: long (nullable = true)
 \|-- task_body: struct (nullable = true)
 \|\|-- cerberus_batch_id: string (nullable = true)
 \|\|-- cerberus_id: string (nullable = true)
 \|\|-- couponId: integer (nullable = true)
 \|\|-- coupon_code: string (nullable = true)
 \|\|-- created: string (nullable = true)
 \|\|-- description: string (nullable = true)
 \|\|-- domain: string (nullable = true)
 \|\|-- expires: string (nullable = true)
 \|\|-- message_id: string (nullable = true)
 \|\|-- neverShowAfter: string (nullable = true)
 \|\|-- neverShowBefore: string (nullable = true)
 \|\|-- offerTitle: string (nullable = true)
 \|\|-- screenshots: array (nullable = true)
 \|\|\|-- element: string (containsNull = false)
 \|\|-- sm.result: struct (nullable = true)
 \|\|\|-- cerberus_batch_id: string (nullable = true)
 \|\|\|-- cerberus_id: string (nullable = true)
 \|\|\|-- code: string (nullable = true)
 \|\|\|-- couponId: integer (nullable = true)
 \|\|\|-- created: string (nullable = true)
 \|\|\|-- description: string (nullable = true)
 \|\|\|-- domain: string (nullable = true)
 \|\|\|-- expires: string (nullable = true)
 \|\|\|-- message_id: string (nullable = true)
 \|\|\|-- neverShowAfter: string (nullable = true)
 \|\|\|-- neverShowBefore: string (nullable = true)
 \|\|\|-- offerTitle: string (nullable = true)
 \|\|\|-- result: struct (nullable = true)
 \|\|\|\|-- post: struct (nullable = true)
 \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: boolean (nullable = true)
 \|\|\|\|\|-- meta: struct (nullable = true)
 \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
 \|\|\|\|\|\|\|-- element: string (containsNull = false)
 \|\|\|\|\|\|-- exceptions: array (nullable = true)
 \|\|\|\|\|\|\|-- element: string (containsNull = false)
 \|\|\|\|\|\|-- no_input_value: array (nullable = true)
 \|\|\|\|\|\|\|-- element: string (containsNull = false)
 \|\|\|\|\|\|-- not_mapped: array (nullable = true)
 \|\|\|\|\|\|\|-- element: string (containsNull = false)
 \|\|\|\|\|\|-- not_transformed: array (nullable = true)
 \|\|\|\|\|\|\|-- element: array (containsNull = false)
 \|\|\|\|\|\|\|\|-- element: string (containsNull = 
false)
 \|\|\|\|\|-- now_price_checkout: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: double (nullable = true)
 \|\|\|\|\|-- shipping_price: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: double (nullable = true)
 \|\|\|\|\|-- tax: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: double (nullable = true)
 \|\|\|\|\|-- total: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: double (nullable = true)
 \|\|\|\|-- pre: struct (nullable = true)
 \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
 \|\|\|\|

[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492126#comment-14492126
 ] 

Sean Owen commented on SPARK-6847:
--

Can you provide (the top part of) the stack overflow stack? so we can see where 
it's occurring. I think it's something building a very long object graph but 
that is the first step to confirm.

 Stack overflow on updateStateByKey which followed by a dstream with 
 checkpoint set
 --

 Key: SPARK-6847
 URL: https://issues.apache.org/jira/browse/SPARK-6847
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Jack Hu
  Labels: StackOverflowError, Streaming

 The issue happens with the following sample code: uses {{updateStateByKey}} 
 followed by a {{map}} with checkpoint interval 10 seconds
 {code}
 val sparkConf = new SparkConf().setAppName(test)
 val streamingContext = new StreamingContext(sparkConf, Seconds(10))
 streamingContext.checkpoint(checkpoint)
 val source = streamingContext.socketTextStream(localhost, )
 val updatedResult = source.map(
 (1,_)).updateStateByKey(
 (newlist : Seq[String], oldstate : Option[String]) = 
 newlist.headOption.orElse(oldstate))
 updatedResult.map(_._2)
 .checkpoint(Seconds(10))
 .foreachRDD((rdd, t) = {
   println(Deep:  + rdd.toDebugString.split(\n).length)
   println(t.toString() + :  + rdd.collect.length)
 })
 streamingContext.start()
 streamingContext.awaitTermination()
 {code}
 From the output, we can see that the dependency will be increasing time over 
 time, the {{updateStateByKey}} never get check-pointed,  and finally, the 
 stack overflow will happen. 
 Note:
 * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but 
 not the {{updateStateByKey}} 
 * If remove the {{checkpoint(Seconds(10))}} from the map result ( 
 {{updatedResult.map(_._2)}} ), the stack overflow will not happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6303) Remove unnecessary Average in GeneratedAggregate

2015-04-13 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-6303:
---
Summary: Remove unnecessary Average in GeneratedAggregate  (was: Average 
should be in canBeCodeGened list)

 Remove unnecessary Average in GeneratedAggregate
 

 Key: SPARK-6303
 URL: https://issues.apache.org/jira/browse/SPARK-6303
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, 
 CollectHashSet. Average should be in the list too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6303) Average should be in canBeCodeGened list

2015-04-13 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-6303:
---
Issue Type: Improvement  (was: Bug)

 Average should be in canBeCodeGened list
 

 Key: SPARK-6303
 URL: https://issues.apache.org/jira/browse/SPARK-6303
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh

 Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, 
 CollectHashSet. Average should be in the list too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6303) Average should be in canBeCodeGened list

2015-04-13 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-6303:
---
Priority: Minor  (was: Major)

 Average should be in canBeCodeGened list
 

 Key: SPARK-6303
 URL: https://issues.apache.org/jira/browse/SPARK-6303
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, 
 CollectHashSet. Average should be in the list too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6877) Add code generation support for Min

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492041#comment-14492041
 ] 

Apache Spark commented on SPARK-6877:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5487

 Add code generation support for Min
 ---

 Key: SPARK-6877
 URL: https://issues.apache.org/jira/browse/SPARK-6877
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6877) Add code generation support for Min

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6877:
---

Assignee: (was: Apache Spark)

 Add code generation support for Min
 ---

 Key: SPARK-6877
 URL: https://issues.apache.org/jira/browse/SPARK-6877
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6877) Add code generation support for Min

2015-04-13 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-6877:
--

 Summary: Add code generation support for Min
 Key: SPARK-6877
 URL: https://issues.apache.org/jira/browse/SPARK-6877
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6877) Add code generation support for Min

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6877:
---

Assignee: Apache Spark

 Add code generation support for Min
 ---

 Key: SPARK-6877
 URL: https://issues.apache.org/jira/browse/SPARK-6877
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6303) Remove unnecessary Average in GeneratedAggregate

2015-04-13 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-6303:
---
Description: 
Because {{Average}} is a {{PartialAggregate}}, we never get a {{Average}} node 
when reaching {{HashAggregation}} to prepare {{GeneratedAggregate}}.

That is why in SQLQuerySuite there is already a test for {{avg}} with codegen. 
And it works.

But we can find a case in {{GeneratedAggregate}} to deal with {{Average}}. 
Based on the above, we actually never execute this case.

So we can remove this case from {{GeneratedAggregate}}.


  was:Currently canBeCodeGened only checks Sum, Count, Max, 
CombineSetsAndCount, CollectHashSet. Average should be in the list too.


 Remove unnecessary Average in GeneratedAggregate
 

 Key: SPARK-6303
 URL: https://issues.apache.org/jira/browse/SPARK-6303
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 Because {{Average}} is a {{PartialAggregate}}, we never get a {{Average}} 
 node when reaching {{HashAggregation}} to prepare {{GeneratedAggregate}}.
 That is why in SQLQuerySuite there is already a test for {{avg}} with 
 codegen. And it works.
 But we can find a case in {{GeneratedAggregate}} to deal with {{Average}}. 
 Based on the above, we actually never execute this case.
 So we can remove this case from {{GeneratedAggregate}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Alberto (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491993#comment-14491993
 ] 

Alberto commented on SPARK-4783:


Does it mean that you guys are going to create a PR with a fix/change proposal 
for this? Or just asking someone to create that PR? If so I am willing to 
create it.

 System.exit() calls in SparkContext disrupt applications embedding Spark
 

 Key: SPARK-4783
 URL: https://issues.apache.org/jira/browse/SPARK-4783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: David Semeria

 A common architectural choice for integrating Spark within a larger 
 application is to employ a gateway to handle Spark jobs. The gateway is a 
 server which contains one or more long-running sparkcontexts.
 A typical server is created with the following pseudo code:
 var continue = true
 while (continue){
  try {
 server.run() 
   } catch (e) {
   continue = log_and_examine_error(e)
 }
 The problem is that sparkcontext frequently calls System.exit when it 
 encounters a problem which means the server can only be re-spawned at the 
 process level, which is much more messy than the simple code above.
 Therefore, I believe it makes sense to replace all System.exit calls in 
 sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4961) Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4961:
---

Assignee: (was: Apache Spark)

 Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted 
 processing time
 ---

 Key: SPARK-4961
 URL: https://issues.apache.org/jira/browse/SPARK-4961
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: YanTang Zhai
Priority: Minor

 HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
 If inputdir is large, getPartitions may spend much time.
 For example, in our cluster, it needs from 0.029s to 766.699s. If one 
 JobSubmitted event is processing, others should wait. Thus, we
 want to put HadoopRDD.getPartitions forward to reduce 
 DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
 need to wait much time. HadoopRDD object could get its partitons when it is 
 instantiated.
 We could analyse and compare the execution time before and after optimization.
 TaskScheduler.start execution time: [time1__]
 DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
 TaskScheduler.start) execution time: [time2_]
 HadoopRDD.getPartitions execution time: [time3___]
 Stages execution time: [time4_]
 (1) The app has only one job
 (a)
 The execution time of the job before optimization is 
 [time1__][time2_][time3___][time4_].
 The execution time of the job after optimization 
 is[time1__][time3___][time2_][time4_].
 In summary, if the app has only one job, the total execution time is same 
 before and after optimization.
 (2) The app has 4 jobs
 (a) Before optimization,
 job1 execution time is [time2_][time3___][time4_],
 job2 execution time is [time2__][time3___][time4_],
 job3 execution time 
 is[time2][time3___][time4_],
 job4 execution time 
 is[time2_][time3___][time4_].
 After optimization, 
 job1 execution time is [time3___][time2_][time4_],
 job2 execution time is [time3___][time2__][time4_],
 job3 execution time 
 is[time3___][time2_][time4_],
 job4 execution time 
 is[time3___][time2__][time4_].
 In summary, if the app has multiple jobs, average execution time after 
 optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4961) Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4961:
---

Assignee: Apache Spark

 Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted 
 processing time
 ---

 Key: SPARK-4961
 URL: https://issues.apache.org/jira/browse/SPARK-4961
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: YanTang Zhai
Assignee: Apache Spark
Priority: Minor

 HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
 If inputdir is large, getPartitions may spend much time.
 For example, in our cluster, it needs from 0.029s to 766.699s. If one 
 JobSubmitted event is processing, others should wait. Thus, we
 want to put HadoopRDD.getPartitions forward to reduce 
 DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
 need to wait much time. HadoopRDD object could get its partitons when it is 
 instantiated.
 We could analyse and compare the execution time before and after optimization.
 TaskScheduler.start execution time: [time1__]
 DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
 TaskScheduler.start) execution time: [time2_]
 HadoopRDD.getPartitions execution time: [time3___]
 Stages execution time: [time4_]
 (1) The app has only one job
 (a)
 The execution time of the job before optimization is 
 [time1__][time2_][time3___][time4_].
 The execution time of the job after optimization 
 is[time1__][time3___][time2_][time4_].
 In summary, if the app has only one job, the total execution time is same 
 before and after optimization.
 (2) The app has 4 jobs
 (a) Before optimization,
 job1 execution time is [time2_][time3___][time4_],
 job2 execution time is [time2__][time3___][time4_],
 job3 execution time 
 is[time2][time3___][time4_],
 job4 execution time 
 is[time2_][time3___][time4_].
 After optimization, 
 job1 execution time is [time3___][time2_][time4_],
 job2 execution time is [time3___][time2__][time4_],
 job3 execution time 
 is[time3___][time2_][time4_],
 job4 execution time 
 is[time3___][time2__][time4_].
 In summary, if the app has multiple jobs, average execution time after 
 optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6562) DataFrame.na.replace value support in Scala/Java

2015-04-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6562:
---
Summary: DataFrame.na.replace value support in Scala/Java  (was: 
DataFrame.na.replace value support)

 DataFrame.na.replace value support in Scala/Java
 

 Key: SPARK-6562
 URL: https://issues.apache.org/jira/browse/SPARK-6562
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.4.0


 Support replacing a set of values with another set of values (i.e. map join), 
 similar to Pandas' replace.
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-13 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491965#comment-14491965
 ] 

Yu Ishikawa commented on SPARK-6682:


[~josephkb] sorry, one more question. Are we allowed to add test suites in 
spark.examples?
We don't have any test suites in spark.examples. However, I think we should 
have them for their performance guarantee. And it is a good timing to add them 
in this issue.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6868.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.2
 Assignee: Dean Chen

 Container link broken on Spark UI Executors page when YARN is set to 
 HTTPS_ONLY
 ---

 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
Reporter: Dean Chen
Assignee: Dean Chen
 Fix For: 1.3.2, 1.4.0

 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png


 The stdout and stderr log links on the executor page will use the http:// 
 prefix even if the node manager does not support http and only https via 
 setting yarn.http.policy=HTTPS_ONLY.
 Unfortunately the unencrypted http link in that case does not return a 404 
 but a binary file containing random binary chars. This causes a lot of 
 confusion for the end user since it seems like the log file exists and is 
 just filled with garbage. (see attached screenshot)
 The fix is to prefix container log links with https:// instead of http:// if 
 yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
 here: 
 https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6868:
-
Priority: Minor  (was: Major)

 Container link broken on Spark UI Executors page when YARN is set to 
 HTTPS_ONLY
 ---

 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
Reporter: Dean Chen
Assignee: Dean Chen
Priority: Minor
 Fix For: 1.3.2, 1.4.0

 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png


 The stdout and stderr log links on the executor page will use the http:// 
 prefix even if the node manager does not support http and only https via 
 setting yarn.http.policy=HTTPS_ONLY.
 Unfortunately the unencrypted http link in that case does not return a 404 
 but a binary file containing random binary chars. This causes a lot of 
 confusion for the end user since it seems like the log file exists and is 
 just filled with garbage. (see attached screenshot)
 The fix is to prefix container log links with https:// instead of http:// if 
 yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
 here: 
 https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6860) Fix the possible inconsistency of StreamingPage

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6860:
-
Priority: Minor  (was: Major)
Assignee: Shixiong Zhu

 Fix the possible inconsistency of StreamingPage
 ---

 Key: SPARK-6860
 URL: https://issues.apache.org/jira/browse/SPARK-6860
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.4.0


 Because StreamingPage.render doesn't hold the listener lock when 
 generating the content, the different parts of content may have some 
 inconsistent values if listener updates its status at the same time. And it 
 will confuse people.
 We should add listener.synchronized to make sure we have a consistent view 
 of StreamingJobProgressListener when creating the content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6870:
-
Priority: Trivial  (was: Minor)
Assignee: Weizhong

 Catch InterruptedException when yarn application state monitor thread been 
 interrupted
 --

 Key: SPARK-6870
 URL: https://issues.apache.org/jira/browse/SPARK-6870
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Weizhong
Assignee: Weizhong
Priority: Trivial
 Fix For: 1.4.0


 On PR #5305 we interrupt the monitor thread but forget to catch the 
 InterruptedException, then in the log will print the stack info, so we need 
 to catch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492215#comment-14492215
 ] 

Sean Owen commented on SPARK-1529:
--

(Sorry if this double-posts.)

Is there a good way to see the whole diff at the moment? I know there's a 
branch with individual commits. Maybe I am missing something basic.

This puts a new abstraction on top of a Hadoop FileSystem on top of the 
underlying file system abstraction. That's getting heavy. If it's only 
abstracting access to an InputStream / OutputStream, why is it needed? that's 
already directly available from, say, Hadoop's FileSystem.

What would be the performance gain if this is the bit being swapped out? This 
is my original question -- you shuffle to HDFS, then read it back to send it 
again via the existing shuffle? It kind of made sense when the idea was to swap 
the whole shuffle to replace its transport.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)
Erik van Oosten created SPARK-6878:
--

 Summary: Sum on empty RDD fails with exception
 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor


{{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.

A simple fix is the replace

{noformat}
class DoubleRDDFunctions {
  def sum(): Double = self.reduce(_ + _)
{noformat} 

with:

{noformat}
class DoubleRDDFunctions {
  def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6762.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5407
[https://github.com/apache/spark/pull/5407]

 Fix potential resource leaks in CheckPoint CheckpointWriter and 
 CheckpointReader
 

 Key: SPARK-6762
 URL: https://issues.apache.org/jira/browse/SPARK-6762
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: zhichao-li
Priority: Minor
 Fix For: 1.4.0


 The close action should be placed within finally block to avoid the potential 
 resource leaks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492302#comment-14492302
 ] 

Erik van Oosten commented on SPARK-6878:


Ah, yes. I now see that fold also first reduces per partition.

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6440) ipv6 URI for HttpServer

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6440:
-
Assignee: Arsenii Krasikov

 ipv6 URI for HttpServer
 ---

 Key: SPARK-6440
 URL: https://issues.apache.org/jira/browse/SPARK-6440
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
 Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster
Reporter: Arsenii Krasikov
Assignee: Arsenii Krasikov
Priority: Minor
 Fix For: 1.4.0


 In {{org.apache.spark.HttpServer}} uri is generated as {code:java}spark:// 
 + localHostname + : + masterPort{code}, where {{localHostname}} is 
 {code:java} org.apache.spark.util.Utils.localHostName() = 
 customHostname.getOrElse(localIpAddressHostname){code}. If the host has an 
 ipv6 address then it would be interpolated into invalid URI:  
 {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of 
 {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}.
 The solution is to separate uri and hostname entities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6440) ipv6 URI for HttpServer

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6440.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5424
[https://github.com/apache/spark/pull/5424]

 ipv6 URI for HttpServer
 ---

 Key: SPARK-6440
 URL: https://issues.apache.org/jira/browse/SPARK-6440
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
 Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster
Reporter: Arsenii Krasikov
Priority: Minor
 Fix For: 1.4.0


 In {{org.apache.spark.HttpServer}} uri is generated as {code:java}spark:// 
 + localHostname + : + masterPort{code}, where {{localHostname}} is 
 {code:java} org.apache.spark.util.Utils.localHostName() = 
 customHostname.getOrElse(localIpAddressHostname){code}. If the host has an 
 ipv6 address then it would be interpolated into invalid URI:  
 {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of 
 {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}.
 The solution is to separate uri and hostname entities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6870.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5479
[https://github.com/apache/spark/pull/5479]

 Catch InterruptedException when yarn application state monitor thread been 
 interrupted
 --

 Key: SPARK-6870
 URL: https://issues.apache.org/jira/browse/SPARK-6870
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Weizhong
Priority: Minor
 Fix For: 1.4.0


 On PR #5305 we interrupt the monitor thread but forget to catch the 
 InterruptedException, then in the log will print the stack info, so we need 
 to catch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6738.
--
Resolution: Not A Problem

We can reopen if there is more detail, but the problem report is focusing on 
the size of one spill file when there are lots of them. The in-memory size is 
also not necessarily the on-disk size. I haven't seen a report of a problem 
here either, like something that then fails.

 EstimateSize  is difference with spill file size
 

 Key: SPARK-6738
 URL: https://issues.apache.org/jira/browse/SPARK-6738
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 ExternalAppendOnlyMap spill 2.2 GB data to disk:
 {code}
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
 in-memory map of 2.2 GB to disk (61 times so far)
 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 But the file size is only 2.2M.
 {code}
 ll -h 
 /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
 total 2.2M
 -rw-r- 1 spark users 2.2M Apr  7 20:27 
 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
 {code}
 The GC log show that the jvm memory is less than 1GB.
 {code}
 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs]
 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs]
 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs]
 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs]
 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs]
 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs]
 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs]
 {code}
 The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6762:
-
Assignee: zhichao-li

 Fix potential resource leaks in CheckPoint CheckpointWriter and 
 CheckpointReader
 

 Key: SPARK-6762
 URL: https://issues.apache.org/jira/browse/SPARK-6762
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: zhichao-li
Assignee: zhichao-li
Priority: Minor
 Fix For: 1.4.0


 The close action should be placed within finally block to avoid the potential 
 resource leaks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-04-13 Thread Yajun Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492246#comment-14492246
 ] 

Yajun Dong commented on SPARK-5281:
---

I also have this isssue with Eclipse Luna and spark 1.3.0, any idea ?

 Registering table on RDD is giving MissingRequirementError
 --

 Key: SPARK-5281
 URL: https://issues.apache.org/jira/browse/SPARK-5281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: sarsol
Priority: Critical

 Application crashes on this line  {{rdd.registerTempTable(temp)}}  in 1.2 
 version when using sbt or Eclipse SCALA IDE
 Stacktrace:
 {code}
 Exception in thread main scala.reflect.internal.MissingRequirementError: 
 class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
 primordial classloader with boot classpath 
 [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
  Files\Java\jre7\lib\resources.jar;C:\Program 
 Files\Java\jre7\lib\rt.jar;C:\Program 
 Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
 Files\Java\jre7\lib\jsse.jar;C:\Program 
 Files\Java\jre7\lib\jce.jar;C:\Program 
 Files\Java\jre7\lib\charsets.jar;C:\Program 
 Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
   at 
 scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
   at 
 scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
   at 
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
   at 
 com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
   at 
 scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
   at scala.App$class.main(App.scala:71)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6800:
---

Assignee: Apache Spark

 Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
 gives incorrect results.
 --

 Key: SPARK-6800
 URL: https://issues.apache.org/jira/browse/SPARK-6800
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
 Scala 2.10
Reporter: Micael Capitão
Assignee: Apache Spark

 Having a Derby table with people info (id, name, age) defined like this:
 {code}
 val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true
 val conn = DriverManager.getConnection(jdbcUrl)
 val stmt = conn.createStatement()
 stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
 IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13))
 {code}
 If I try to read that table from Spark SQL with lower/upper bounds, like this:
 {code}
 val people = sqlContext.jdbc(url = jdbcUrl, table = Person,
   columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10)
 people.show()
 {code}
 I get this result:
 {noformat}
 PERSON_ID NAME AGE
 3 Ana Rita Costa   12 
 5 Miguel Costa 15 
 6 Anabela Sintra   13 
 2 Lurdes Pereira   23 
 4 Armando Pereira  32 
 1 Armando Carvalho 50 
 {noformat}
 Which is wrong, considering the defined upper bound has been ignored (I get a 
 person with age 50!).
 Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
 WHERE clauses it generates are the following:
 {code}
 (0) age  4,0
 (1) age = 4  AND age  8,1
 (2) age = 8  AND age  12,2
 (3) age = 12 AND age  16,3
 (4) age = 16 AND age  20,4
 (5) age = 20 AND age  24,5
 (6) age = 24 AND age  28,6
 (7) age = 28 AND age  32,7
 (8) age = 32 AND age  36,8
 (9) age = 36,9
 {code}
 The last condition ignores the upper bound and the other ones may result in 
 repeated rows being read.
 Using the JdbcRDD (and converting it to a DataFrame) I would have something 
 like this:
 {code}
 val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl),
   SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10,
   rs = (rs.getInt(1), rs.getString(2), rs.getInt(3)))
 val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE)
 people.show()
 {code}
 Resulting in:
 {noformat}
 PERSON_ID NAMEAGE
 3 Ana Rita Costa  12 
 5 Miguel Costa15 
 6 Anabela Sintra  13 
 2 Lurdes Pereira  23 
 4 Armando Pereira 32 
 {noformat}
 Which is correct!
 Confirming the WHERE clauses generated by the JdbcRDD in the 
 {{getPartitions}} I've found it generates the following:
 {code}
 (0) age = 0  AND age = 3
 (1) age = 4  AND age = 7
 (2) age = 8  AND age = 11
 (3) age = 12 AND age = 15
 (4) age = 16 AND age = 19
 (5) age = 20 AND age = 23
 (6) age = 24 AND age = 27
 (7) age = 28 AND age = 31
 (8) age = 32 AND age = 35
 (9) age = 36 AND age = 40
 {code}
 This is the behaviour I was expecting from the Spark SQL version. Is the 
 Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492244#comment-14492244
 ] 

Apache Spark commented on SPARK-6800:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5488

 Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
 gives incorrect results.
 --

 Key: SPARK-6800
 URL: https://issues.apache.org/jira/browse/SPARK-6800
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
 Scala 2.10
Reporter: Micael Capitão

 Having a Derby table with people info (id, name, age) defined like this:
 {code}
 val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true
 val conn = DriverManager.getConnection(jdbcUrl)
 val stmt = conn.createStatement()
 stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
 IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13))
 {code}
 If I try to read that table from Spark SQL with lower/upper bounds, like this:
 {code}
 val people = sqlContext.jdbc(url = jdbcUrl, table = Person,
   columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10)
 people.show()
 {code}
 I get this result:
 {noformat}
 PERSON_ID NAME AGE
 3 Ana Rita Costa   12 
 5 Miguel Costa 15 
 6 Anabela Sintra   13 
 2 Lurdes Pereira   23 
 4 Armando Pereira  32 
 1 Armando Carvalho 50 
 {noformat}
 Which is wrong, considering the defined upper bound has been ignored (I get a 
 person with age 50!).
 Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
 WHERE clauses it generates are the following:
 {code}
 (0) age  4,0
 (1) age = 4  AND age  8,1
 (2) age = 8  AND age  12,2
 (3) age = 12 AND age  16,3
 (4) age = 16 AND age  20,4
 (5) age = 20 AND age  24,5
 (6) age = 24 AND age  28,6
 (7) age = 28 AND age  32,7
 (8) age = 32 AND age  36,8
 (9) age = 36,9
 {code}
 The last condition ignores the upper bound and the other ones may result in 
 repeated rows being read.
 Using the JdbcRDD (and converting it to a DataFrame) I would have something 
 like this:
 {code}
 val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl),
   SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10,
   rs = (rs.getInt(1), rs.getString(2), rs.getInt(3)))
 val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE)
 people.show()
 {code}
 Resulting in:
 {noformat}
 PERSON_ID NAMEAGE
 3 Ana Rita Costa  12 
 5 Miguel Costa15 
 6 Anabela Sintra  13 
 2 Lurdes Pereira  23 
 4 Armando Pereira 32 
 {noformat}
 Which is correct!
 Confirming the WHERE clauses generated by the JdbcRDD in the 
 {{getPartitions}} I've found it generates the following:
 {code}
 (0) age = 0  AND age = 3
 (1) age = 4  AND age = 7
 (2) age = 8  AND age = 11
 (3) age = 12 AND age = 15
 (4) age = 16 AND age = 19
 (5) age = 20 AND age = 23
 (6) age = 24 AND age = 27
 (7) age = 28 AND age = 31
 (8) age = 32 AND age = 35
 (9) age = 36 AND age = 40
 {code}
 This is the behaviour I was expecting from the Spark SQL version. Is the 
 Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6860) Fix the possible inconsistency of StreamingPage

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6860.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5470
[https://github.com/apache/spark/pull/5470]

 Fix the possible inconsistency of StreamingPage
 ---

 Key: SPARK-6860
 URL: https://issues.apache.org/jira/browse/SPARK-6860
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Reporter: Shixiong Zhu
 Fix For: 1.4.0


 Because StreamingPage.render doesn't hold the listener lock when 
 generating the content, the different parts of content may have some 
 inconsistent values if listener updates its status at the same time. And it 
 will confuse people.
 We should add listener.synchronized to make sure we have a consistent view 
 of StreamingJobProgressListener when creating the content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492284#comment-14492284
 ] 

Sean Owen commented on SPARK-6878:
--

Yes, and I think it could even be a little simpler by calling {{fold(0.0)(_ + 
_)}} ?

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6878:
---

Assignee: Apache Spark

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Assignee: Apache Spark
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492336#comment-14492336
 ] 

Erik van Oosten commented on SPARK-6878:


Pull request: https://github.com/apache/spark/pull/5489

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492271#comment-14492271
 ] 

Sean Owen commented on SPARK-6878:
--

Interesting question -- what's the expected sum of nothing at all? although I 
can see the argument both ways, 0 is probably the better result since 
{{Array[Double]().sum}} is 0. So {{sc.parallelize(Array[Double]()).sum}} should 
as well. Want to make a PR?

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492282#comment-14492282
 ] 

Erik van Oosten commented on SPARK-6878:


The answer is only defined because the RDD is an {{RDD[Double]}} :)

Sure, I'll make a PR.

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6800:
---

Assignee: (was: Apache Spark)

 Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
 gives incorrect results.
 --

 Key: SPARK-6800
 URL: https://issues.apache.org/jira/browse/SPARK-6800
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
 Scala 2.10
Reporter: Micael Capitão

 Having a Derby table with people info (id, name, age) defined like this:
 {code}
 val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true
 val conn = DriverManager.getConnection(jdbcUrl)
 val stmt = conn.createStatement()
 stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
 IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13))
 {code}
 If I try to read that table from Spark SQL with lower/upper bounds, like this:
 {code}
 val people = sqlContext.jdbc(url = jdbcUrl, table = Person,
   columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10)
 people.show()
 {code}
 I get this result:
 {noformat}
 PERSON_ID NAME AGE
 3 Ana Rita Costa   12 
 5 Miguel Costa 15 
 6 Anabela Sintra   13 
 2 Lurdes Pereira   23 
 4 Armando Pereira  32 
 1 Armando Carvalho 50 
 {noformat}
 Which is wrong, considering the defined upper bound has been ignored (I get a 
 person with age 50!).
 Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
 WHERE clauses it generates are the following:
 {code}
 (0) age  4,0
 (1) age = 4  AND age  8,1
 (2) age = 8  AND age  12,2
 (3) age = 12 AND age  16,3
 (4) age = 16 AND age  20,4
 (5) age = 20 AND age  24,5
 (6) age = 24 AND age  28,6
 (7) age = 28 AND age  32,7
 (8) age = 32 AND age  36,8
 (9) age = 36,9
 {code}
 The last condition ignores the upper bound and the other ones may result in 
 repeated rows being read.
 Using the JdbcRDD (and converting it to a DataFrame) I would have something 
 like this:
 {code}
 val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl),
   SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10,
   rs = (rs.getInt(1), rs.getString(2), rs.getInt(3)))
 val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE)
 people.show()
 {code}
 Resulting in:
 {noformat}
 PERSON_ID NAMEAGE
 3 Ana Rita Costa  12 
 5 Miguel Costa15 
 6 Anabela Sintra  13 
 2 Lurdes Pereira  23 
 4 Armando Pereira 32 
 {noformat}
 Which is correct!
 Confirming the WHERE clauses generated by the JdbcRDD in the 
 {{getPartitions}} I've found it generates the following:
 {code}
 (0) age = 0  AND age = 3
 (1) age = 4  AND age = 7
 (2) age = 8  AND age = 11
 (3) age = 12 AND age = 15
 (4) age = 16 AND age = 19
 (5) age = 20 AND age = 23
 (6) age = 24 AND age = 27
 (7) age = 28 AND age = 31
 (8) age = 32 AND age = 35
 (9) age = 36 AND age = 40
 {code}
 This is the behaviour I was expecting from the Spark SQL version. Is the 
 Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-04-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492303#comment-14492303
 ] 

Steve Loughran commented on SPARK-1537:
---

HADOOP-11826 patches the hadoop compatibility document to add timeline server 
to the list of stable APIs.

 Add integration with Yarn's Application Timeline Server
 ---

 Key: SPARK-1537
 URL: https://issues.apache.org/jira/browse/SPARK-1537
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Attachments: SPARK-1537.txt, spark-1573.patch


 It would be nice to have Spark integrate with Yarn's Application Timeline 
 Server (see YARN-321, YARN-1530). This would allow users running Spark on 
 Yarn to have a single place to go for all their history needs, and avoid 
 having to manage a separate service (Spark's built-in server).
 At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
 although there is still some ongoing work. But the basics are there, and I 
 wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6671) Add status command for spark daemons

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6671.
--
Resolution: Fixed

Issue resolved by pull request 5327
[https://github.com/apache/spark/pull/5327]

 Add status command for spark daemons
 

 Key: SPARK-6671
 URL: https://issues.apache.org/jira/browse/SPARK-6671
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: PRADEEP CHANUMOLU
  Labels: easyfix
 Fix For: 1.4.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently using the spark-daemon.sh script we can start and stop the spark 
 demons. But we cannot get the status of the daemons. It will be nice to 
 include the status command in the spark-daemon.sh script, through which we 
 can know if the spark demon is alive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6671) Add status command for spark daemons

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6671:
-
Priority: Minor  (was: Major)
Assignee: PRADEEP CHANUMOLU

 Add status command for spark daemons
 

 Key: SPARK-6671
 URL: https://issues.apache.org/jira/browse/SPARK-6671
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: PRADEEP CHANUMOLU
Assignee: PRADEEP CHANUMOLU
Priority: Minor
  Labels: easyfix
 Fix For: 1.4.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently using the spark-daemon.sh script we can start and stop the spark 
 demons. But we cannot get the status of the daemons. It will be nice to 
 include the status command in the spark-daemon.sh script, through which we 
 can know if the spark demon is alive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6879) Check if the app is completed before clean it up

2015-04-13 Thread Tao Wang (JIRA)
Tao Wang created SPARK-6879:
---

 Summary: Check if the app is completed before clean it up
 Key: SPARK-6879
 URL: https://issues.apache.org/jira/browse/SPARK-6879
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Tao Wang


Now history server deletes the directory whichi expires according to its 
modification time. It is not good for those long-running applicaitons, as they 
might be deleted before finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik van Oosten updated SPARK-6878:
---
Flags: Patch

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6878:
---

Assignee: (was: Apache Spark)

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492335#comment-14492335
 ] 

Apache Spark commented on SPARK-6878:
-

User 'erikvanoosten' has created a pull request for this issue:
https://github.com/apache/spark/pull/5489

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6875) Add support for Joda-time types

2015-04-13 Thread Patrick Grandjean (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Grandjean updated SPARK-6875:
-
Description: 
The need comes from the following use case:

val objs: RDD[MyClass] = [...]

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

objs.saveAsParquetFile(parquet)

MyClass contains joda-time fields. When saving to parquet file, an exception is 
thrown (matchError in ScalaReflection.scala).

Spark SQL supports java SQL date/time types. This request is to add support for 
Joda-time types. 

It is possible to define UDT's using the @SQLUserDefinedType annotation. 
However, in addition to annotations, it would be nice to be able to 
programmatically/dynamically add UDTs.

  was:
The need comes from the following use case:

val objs: RDD[MyClass] = [...]

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

objs.saveAsParquetFile(parquet)

MyClass contains joda-time fields. When saving to parquet file, an exception is 
thrown (matchError in ScalaReflection.scala).

Spark SQL supports java SQL date/time types. This request is to add support for 
Joda-time types. 

Another alternative would be, in addition to annotations, to be able to 
programmatically and dynamically add UDTs.


 Add support for Joda-time types
 ---

 Key: SPARK-6875
 URL: https://issues.apache.org/jira/browse/SPARK-6875
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Patrick Grandjean

 The need comes from the following use case:
 val objs: RDD[MyClass] = [...]
 val sqlC = new org.apache.spark.sql.SQLContext(sc)
 import sqlC._
 objs.saveAsParquetFile(parquet)
 MyClass contains joda-time fields. When saving to parquet file, an exception 
 is thrown (matchError in ScalaReflection.scala).
 Spark SQL supports java SQL date/time types. This request is to add support 
 for Joda-time types. 
 It is possible to define UDT's using the @SQLUserDefinedType annotation. 
 However, in addition to annotations, it would be nice to be able to 
 programmatically/dynamically add UDTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile

2015-04-13 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6352.
---
  Resolution: Fixed
   Fix Version/s: 1.4.0
Target Version/s: 1.4.0

Resolved by https://github.com/apache/spark/pull/5042

[~pwendell] Tried to assign this ticket to [~pllee], but couldn't put his name 
in the Assignee field. Do we need to set some privilege stuff?

 Supporting non-default OutputCommitter when using saveAsParquetFile
 ---

 Key: SPARK-6352
 URL: https://issues.apache.org/jira/browse/SPARK-6352
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.1, 1.3.0
Reporter: Pei-Lun Lee
 Fix For: 1.4.0


 SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can 
 be nice to have similar behavior in saveAsParquetFile. It maybe difficult to 
 have a fully customizable OutputCommitter solution, at least adding something 
 like a DirectParquetOutputCommitter and letting users choose between this and 
 the default should be enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6875) Add support for Joda-time types

2015-04-13 Thread Patrick Grandjean (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Grandjean updated SPARK-6875:
-
Description: 
The need comes from the following use case:

val objs: RDD[MyClass] = [...]

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

objs.saveAsParquetFile(parquet)

MyClass contains joda-time fields. When saving to parquet file, an exception is 
thrown (matchError in ScalaReflection.scala).

Spark SQL supports java SQL date/time types. This request is to add support for 
Joda-time types. 

Another alternative would be, in addition to annotations, to be able to 
programmatically and dynamically add UDTs.

  was:
The need comes from the following use case:

val objs: RDD[MyClass] = [...]

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

objs.saveAsParquetFile(parquet)

MyClass contains joda-time fields. When saving to parquet file, an exception is 
thrown (matchError in ScalaReflection.scala).

Spark SQL supports java SQL date/time types. This request is to add support for 
Joda-time types.


 Add support for Joda-time types
 ---

 Key: SPARK-6875
 URL: https://issues.apache.org/jira/browse/SPARK-6875
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Patrick Grandjean

 The need comes from the following use case:
 val objs: RDD[MyClass] = [...]
 val sqlC = new org.apache.spark.sql.SQLContext(sc)
 import sqlC._
 objs.saveAsParquetFile(parquet)
 MyClass contains joda-time fields. When saving to parquet file, an exception 
 is thrown (matchError in ScalaReflection.scala).
 Spark SQL supports java SQL date/time types. This request is to add support 
 for Joda-time types. 
 Another alternative would be, in addition to annotations, to be able to 
 programmatically and dynamically add UDTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6207) YARN secure cluster mode doesn't obtain a hive-metastore token

2015-04-13 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-6207.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 YARN secure cluster mode doesn't obtain a hive-metastore token 
 ---

 Key: SPARK-6207
 URL: https://issues.apache.org/jira/browse/SPARK-6207
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit, SQL, YARN
Affects Versions: 1.2.0, 1.2.1, 1.3.0
 Environment: YARN
Reporter: Doug Balog
 Fix For: 1.4.0


 When running a spark job, on YARN in secure mode, with --deploy-mode 
 cluster,  org.apache.spark.deploy.yarn.Client() does not obtain a delegation 
 token to the hive-metastore. Therefore any attempts to talk to the 
 hive-metastore fail with a GSSException: No valid credentials provided...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5689) Document what can be run in different YARN modes

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5689:
---

Assignee: (was: Apache Spark)

 Document what can be run in different YARN modes
 

 Key: SPARK-5689
 URL: https://issues.apache.org/jira/browse/SPARK-5689
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves

 We should document what can be run in the different yarn modes. For 
 instances, the interactive shell only work in yarn client mode, recently with 
 https://github.com/apache/spark/pull/3976 users can run python scripts in 
 cluster mode, etc..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6879) Check if the app is completed before clean it up

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492348#comment-14492348
 ] 

Apache Spark commented on SPARK-6879:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/5491

 Check if the app is completed before clean it up
 

 Key: SPARK-6879
 URL: https://issues.apache.org/jira/browse/SPARK-6879
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Tao Wang

 Now history server deletes the directory whichi expires according to its 
 modification time. It is not good for those long-running applicaitons, as 
 they might be deleted before finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6879) Check if the app is completed before clean it up

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6879:
---

Assignee: Apache Spark

 Check if the app is completed before clean it up
 

 Key: SPARK-6879
 URL: https://issues.apache.org/jira/browse/SPARK-6879
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Tao Wang
Assignee: Apache Spark

 Now history server deletes the directory whichi expires according to its 
 modification time. It is not good for those long-running applicaitons, as 
 they might be deleted before finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6879) Check if the app is completed before clean it up

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6879:
---

Assignee: (was: Apache Spark)

 Check if the app is completed before clean it up
 

 Key: SPARK-6879
 URL: https://issues.apache.org/jira/browse/SPARK-6879
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Tao Wang

 Now history server deletes the directory whichi expires according to its 
 modification time. It is not good for those long-running applicaitons, as 
 they might be deleted before finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5689) Document what can be run in different YARN modes

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492349#comment-14492349
 ] 

Apache Spark commented on SPARK-5689:
-

User 'Sephiroth-Lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5490

 Document what can be run in different YARN modes
 

 Key: SPARK-5689
 URL: https://issues.apache.org/jira/browse/SPARK-5689
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves

 We should document what can be run in the different yarn modes. For 
 instances, the interactive shell only work in yarn client mode, recently with 
 https://github.com/apache/spark/pull/3976 users can run python scripts in 
 cluster mode, etc..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5689) Document what can be run in different YARN modes

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5689:
---

Assignee: Apache Spark

 Document what can be run in different YARN modes
 

 Key: SPARK-5689
 URL: https://issues.apache.org/jira/browse/SPARK-5689
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Apache Spark

 We should document what can be run in the different yarn modes. For 
 instances, the interactive shell only work in yarn client mode, recently with 
 https://github.com/apache/spark/pull/3976 users can run python scripts in 
 cluster mode, etc..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492364#comment-14492364
 ] 

Micael Capitão commented on SPARK-6800:
---

The above pull request seem to only fix the upper and lower bounds issue. There 
is still the intermediate queries issue that may result in repeated rows being 
fetched from a DB.

 Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
 gives incorrect results.
 --

 Key: SPARK-6800
 URL: https://issues.apache.org/jira/browse/SPARK-6800
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
 Scala 2.10
Reporter: Micael Capitão

 Having a Derby table with people info (id, name, age) defined like this:
 {code}
 val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true
 val conn = DriverManager.getConnection(jdbcUrl)
 val stmt = conn.createStatement()
 stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
 IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15))
 stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13))
 {code}
 If I try to read that table from Spark SQL with lower/upper bounds, like this:
 {code}
 val people = sqlContext.jdbc(url = jdbcUrl, table = Person,
   columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10)
 people.show()
 {code}
 I get this result:
 {noformat}
 PERSON_ID NAME AGE
 3 Ana Rita Costa   12 
 5 Miguel Costa 15 
 6 Anabela Sintra   13 
 2 Lurdes Pereira   23 
 4 Armando Pereira  32 
 1 Armando Carvalho 50 
 {noformat}
 Which is wrong, considering the defined upper bound has been ignored (I get a 
 person with age 50!).
 Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
 WHERE clauses it generates are the following:
 {code}
 (0) age  4,0
 (1) age = 4  AND age  8,1
 (2) age = 8  AND age  12,2
 (3) age = 12 AND age  16,3
 (4) age = 16 AND age  20,4
 (5) age = 20 AND age  24,5
 (6) age = 24 AND age  28,6
 (7) age = 28 AND age  32,7
 (8) age = 32 AND age  36,8
 (9) age = 36,9
 {code}
 The last condition ignores the upper bound and the other ones may result in 
 repeated rows being read.
 Using the JdbcRDD (and converting it to a DataFrame) I would have something 
 like this:
 {code}
 val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl),
   SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10,
   rs = (rs.getInt(1), rs.getString(2), rs.getInt(3)))
 val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE)
 people.show()
 {code}
 Resulting in:
 {noformat}
 PERSON_ID NAMEAGE
 3 Ana Rita Costa  12 
 5 Miguel Costa15 
 6 Anabela Sintra  13 
 2 Lurdes Pereira  23 
 4 Armando Pereira 32 
 {noformat}
 Which is correct!
 Confirming the WHERE clauses generated by the JdbcRDD in the 
 {{getPartitions}} I've found it generates the following:
 {code}
 (0) age = 0  AND age = 3
 (1) age = 4  AND age = 7
 (2) age = 8  AND age = 11
 (3) age = 12 AND age = 15
 (4) age = 16 AND age = 19
 (5) age = 20 AND age = 23
 (6) age = 24 AND age = 27
 (7) age = 28 AND age = 31
 (8) age = 32 AND age = 35
 (9) age = 36 AND age = 40
 {code}
 This is the behaviour I was expecting from the Spark SQL version. Is the 
 Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492346#comment-14492346
 ] 

Sean Owen commented on SPARK-4783:
--

I have a PR ready, but am testing it. I am seeing test failures but am not sure 
if they're related. You are also welcome to go ahead with a PR if you think you 
have a handle on it and I can chime in with what I know.

 System.exit() calls in SparkContext disrupt applications embedding Spark
 

 Key: SPARK-4783
 URL: https://issues.apache.org/jira/browse/SPARK-4783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: David Semeria

 A common architectural choice for integrating Spark within a larger 
 application is to employ a gateway to handle Spark jobs. The gateway is a 
 server which contains one or more long-running sparkcontexts.
 A typical server is created with the following pseudo code:
 var continue = true
 while (continue){
  try {
 server.run() 
   } catch (e) {
   continue = log_and_examine_error(e)
 }
 The problem is that sparkcontext frequently calls System.exit when it 
 encounters a problem which means the server can only be re-spawned at the 
 process level, which is much more messy than the simple code above.
 Therefore, I believe it makes sense to replace all System.exit calls in 
 sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4783:
---

Assignee: Apache Spark

 System.exit() calls in SparkContext disrupt applications embedding Spark
 

 Key: SPARK-4783
 URL: https://issues.apache.org/jira/browse/SPARK-4783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: David Semeria
Assignee: Apache Spark

 A common architectural choice for integrating Spark within a larger 
 application is to employ a gateway to handle Spark jobs. The gateway is a 
 server which contains one or more long-running sparkcontexts.
 A typical server is created with the following pseudo code:
 var continue = true
 while (continue){
  try {
 server.run() 
   } catch (e) {
   continue = log_and_examine_error(e)
 }
 The problem is that sparkcontext frequently calls System.exit when it 
 encounters a problem which means the server can only be re-spawned at the 
 process level, which is much more messy than the simple code above.
 Therefore, I believe it makes sense to replace all System.exit calls in 
 sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4783:
---

Assignee: (was: Apache Spark)

 System.exit() calls in SparkContext disrupt applications embedding Spark
 

 Key: SPARK-4783
 URL: https://issues.apache.org/jira/browse/SPARK-4783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: David Semeria

 A common architectural choice for integrating Spark within a larger 
 application is to employ a gateway to handle Spark jobs. The gateway is a 
 server which contains one or more long-running sparkcontexts.
 A typical server is created with the following pseudo code:
 var continue = true
 while (continue){
  try {
 server.run() 
   } catch (e) {
   continue = log_and_examine_error(e)
 }
 The problem is that sparkcontext frequently calls System.exit when it 
 encounters a problem which means the server can only be re-spawned at the 
 process level, which is much more messy than the simple code above.
 Therefore, I believe it makes sense to replace all System.exit calls in 
 sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492393#comment-14492393
 ] 

Apache Spark commented on SPARK-4783:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5492

 System.exit() calls in SparkContext disrupt applications embedding Spark
 

 Key: SPARK-4783
 URL: https://issues.apache.org/jira/browse/SPARK-4783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: David Semeria

 A common architectural choice for integrating Spark within a larger 
 application is to employ a gateway to handle Spark jobs. The gateway is a 
 server which contains one or more long-running sparkcontexts.
 A typical server is created with the following pseudo code:
 var continue = true
 while (continue){
  try {
 server.run() 
   } catch (e) {
   continue = log_and_examine_error(e)
 }
 The problem is that sparkcontext frequently calls System.exit when it 
 encounters a problem which means the server can only be re-spawned at the 
 process level, which is much more messy than the simple code above.
 Therefore, I believe it makes sense to replace all System.exit calls in 
 sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-04-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492618#comment-14492618
 ] 

Yin Huai commented on SPARK-5791:
-

[~jameszhouyi] Thank you for the update :) For Hive, it also used Parquet in 
your last run, right?

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physcial_Plan_Hive.txt, 
 Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492622#comment-14492622
 ] 

Apache Spark commented on SPARK-6880:
-

User 'pankajarora12' has created a pull request for this issue:
https://github.com/apache/spark/pull/5494

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDD
 --

 Key: SPARK-6880
 URL: https://issues.apache.org/jira/browse/SPARK-6880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: CentOs6.0, java7
Reporter: pankaj arora

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-04-13 Thread Andrew Lee (JIRA)
Andrew Lee created SPARK-6882:
-

 Summary: Spark ThriftServer2 Kerberos failed encountering 
java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: 
[auth-int, auth-conf, auth]
 Key: SPARK-6882
 URL: https://issues.apache.org/jira/browse/SPARK-6882
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0, 1.2.1
 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled
* Apache Hive 0.13.1
* Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
* Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
Reporter: Andrew Lee


When Kerberos is enabled, I get the following exceptions. 
{code}
2015-03-13 18:26:05,363 ERROR 
org.apache.hive.service.cli.thrift.ThriftCLIService 
(ThriftBinaryCLIService.java:run(93)) - Error: 
java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: 
[auth-int, auth-conf, auth]
{code}

I tried it in
* Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
* Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851

with
* Apache Hive 0.13.1
* Apache Hadoop 2.4.1

Build command
{code}
mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver 
-Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests 
install
{code}

When starting Spark ThriftServer in {{yarn-client}} mode, the command to start 
thriftserver looks like this

{code}
./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
hive.server2.thrift.bind.host=$(hostname) --master yarn-client
{code}

{{hostname}} points to the current hostname of the machine I'm using.

Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1)
{code}
2015-03-13 18:26:05,363 ERROR 
org.apache.hive.service.cli.thrift.ThriftCLIService 
(ThriftBinaryCLIService.java:run(93)) - Error: 
java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: 
[auth-int, auth-conf, auth]
at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
at 
org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
at 
org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
at 
org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
at java.lang.Thread.run(Thread.java:744)
{code}

I'm wondering if this is due to the same problem described in HIVE-8154 
HIVE-7620 due to an older code based for the Spark ThriftServer?

Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to run 
against a Kerberos cluster (Apache 2.4.1).

My hive-site.xml looks like the following for spark/conf.
The kerberos keytab and tgt are configured correctly, I'm able to connect to 
metastore, but the subsequent steps failed due to the exception.
{code}
property
  namehive.semantic.analyzer.factory.impl/name
  valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value
/property
property
  namehive.metastore.execute.setugi/name
  valuetrue/value
/property
property
  namehive.stats.autogather/name
  valuefalse/value
/property
property
  namehive.session.history.enabled/name
  valuetrue/value
/property
property
  namehive.querylog.location/name
  value/tmp/home/hive/log/${user.name}/value
/property
property
  namehive.exec.local.scratchdir/name
  value/tmp/hive/scratch/${user.name}/value
/property
property
  namehive.metastore.uris/name
  valuethrift://somehostname:9083/value
/property
!-- HIVE SERVER 2 --
property
  namehive.server2.authentication/name
  valueKERBEROS/value
/property
property
  namehive.server2.authentication.kerberos.principal/name
  value***/value
/property
property
  namehive.server2.authentication.kerberos.keytab/name
  value***/value
/property
property
  namehive.server2.thrift.sasl.qop/name
  valueauth/value
  descriptionSasl QOP value; one of 'auth', 'auth-int' and 
'auth-conf'/description
/property
property
  namehive.server2.enable.impersonation/name
  descriptionEnable user impersonation for HiveServer2/description
  valuetrue/value
/property
!-- HIVE METASTORE --
property
  namehive.metastore.sasl.enabled/name
  valuetrue/value
/property
property
  namehive.metastore.kerberos.keytab.file/name
  value***/value
/property
property
  namehive.metastore.kerberos.principal/name
  value***/value
/property
property
  namehive.metastore.cache.pinobjtypes/name
  valueTable,Database,Type,FieldSchema,Order/value
/property
property
  namehdfs_sentinel_file/name
  value***/value
/property
property
  namehive.metastore.warehouse.dir/name
  value/hive/value
/property
property
  namehive.metastore.client.socket.timeout/name
  value600/value
/property
property
  namehive.warehouse.subdir.inherit.perms/name
  valuetrue/value
/property
{code}

Here, I'm attaching a more detail logs from Spark 1.3 rc1.
{code}
2015-04-13 16:37:20,688 INFO  

[jira] [Assigned] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6880:
---

Assignee: Apache Spark

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDD
 --

 Key: SPARK-6880
 URL: https://issues.apache.org/jira/browse/SPARK-6880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: CentOs6.0, java7
Reporter: pankaj arora
Assignee: Apache Spark

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6880:
---

Assignee: (was: Apache Spark)

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDD
 --

 Key: SPARK-6880
 URL: https://issues.apache.org/jira/browse/SPARK-6880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: CentOs6.0, java7
Reporter: pankaj arora

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread pankaj arora (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pankaj arora updated SPARK-6880:

Description: 
Spark Shutdowns with NoSuchElementException when running parallel collect on 
cachedRDDs

Below is the stack trace

15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
failed; shutting down SparkContext
java.util.NoSuchElementException: key not found: 28
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)


  was:Spark Shutdowns with NoSuchElementException when running parallel collect 
on cachedRDDs


 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDD
 --

 Key: SPARK-6880
 URL: https://issues.apache.org/jira/browse/SPARK-6880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: CentOs6.0, java7
Reporter: pankaj arora

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDDs
 Below is the stack trace
 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
 failed; shutting down SparkContext
 java.util.NoSuchElementException: key not found: 28
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
 at 
 

[jira] [Commented] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread pankaj arora (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492642#comment-14492642
 ] 

pankaj arora commented on SPARK-6880:
-

Sean,
Sorry for missing stack trace. Added that in description.

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDD
 --

 Key: SPARK-6880
 URL: https://issues.apache.org/jira/browse/SPARK-6880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: CentOs6.0, java7
Reporter: pankaj arora

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDDs
 Below is the stack trace
 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
 failed; shutting down SparkContext
 java.util.NoSuchElementException: key not found: 28
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6823) Add a model.matrix like capability to DataFrames (modelDataFrame)

2015-04-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492661#comment-14492661
 ] 

Shivaram Venkataraman commented on SPARK-6823:
--

I think the goal of the original JIRA on SparkR was to have a high-level API 
that'll allow users to express this . We could have this higher-level API in a 
DataFrame or just provide a wrapper around OneHotEncoder + VectorAssembler in 
the SparkR ML integration work. I think the second one sounds better to me, but 
 [~cafreeman] and Dan Putler have been looking at this and might be able to add 
more.

 Add a model.matrix like capability to DataFrames (modelDataFrame)
 -

 Key: SPARK-6823
 URL: https://issues.apache.org/jira/browse/SPARK-6823
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Reporter: Shivaram Venkataraman

 Currently Mllib modeling tools work only with double data. However, data 
 tables in practice often have a set of categorical fields (factors in R), 
 that need to be converted to a set of 0/1 indicator variables (making the 
 data actually used in a modeling algorithm completely numeric). In R, this is 
 handled in modeling functions using the model.matrix function. Similar 
 functionality needs to be available within Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-13 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492667#comment-14492667
 ] 

Cheng Lian commented on SPARK-6859:
---

[~rdblue] pointed out 1 fact that I missed in PARQUET-251: we need to work out 
a way to ignore (binary) min/max stats for all existing data.

So from Spark SQL side, we have to disable filter push-down for binary columns.

 Parquet File Binary column statistics error when reuse byte[] among rows
 

 Key: SPARK-6859
 URL: https://issues.apache.org/jira/browse/SPARK-6859
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Yijie Shen
Priority: Minor

 Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
 GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
 reused among rows but has different content each time. When I convert it to a 
 dataFrame and save it as Parquet File, the file's row group statistic(max  
 min) of Binary column would be wrong.
 \\
 \\
 Here is the reason: In Parquet, BinaryStatistic just keep max  min as 
 parquet.io.api.Binary references, Spark sql would generate a new Binary 
 backed by the same Array\[Byte\] passed from row.
   
 | |reference| |backed| |  
 |max: Binary|--|ByteArrayBackedBinary|--|Array\[Byte\]|
 Therefore, each time parquet updating row group's statistic, max  min would 
 always refer to the same Array\[Byte\], which has new content each time. When 
 parquet decides to save it into file, the last row's content would be saved 
 as both max  min.
 \\
 \\
 It seems it is a parquet bug because it's parquet's responsibility to update 
 statistics correctly.
 But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6882:
-
Component/s: SQL

 Spark ThriftServer2 Kerberos failed encountering 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 

 Key: SPARK-6882
 URL: https://issues.apache.org/jira/browse/SPARK-6882
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0
 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled
 * Apache Hive 0.13.1
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
Reporter: Andrew Lee

 When Kerberos is enabled, I get the following exceptions. 
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 {code}
 I tried it in
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
 with
 * Apache Hive 0.13.1
 * Apache Hadoop 2.4.1
 Build command
 {code}
 mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver 
 -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests 
 install
 {code}
 When starting Spark ThriftServer in {{yarn-client}} mode, the command to 
 start thriftserver looks like this
 {code}
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
 hive.server2.thrift.bind.host=$(hostname) --master yarn-client
 {code}
 {{hostname}} points to the current hostname of the machine I'm using.
 Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1)
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
 at 
 org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 I'm wondering if this is due to the same problem described in HIVE-8154 
 HIVE-7620 due to an older code based for the Spark ThriftServer?
 Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to 
 run against a Kerberos cluster (Apache 2.4.1).
 My hive-site.xml looks like the following for spark/conf.
 The kerberos keytab and tgt are configured correctly, I'm able to connect to 
 metastore, but the subsequent steps failed due to the exception.
 {code}
 property
   namehive.semantic.analyzer.factory.impl/name
   valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value
 /property
 property
   namehive.metastore.execute.setugi/name
   valuetrue/value
 /property
 property
   namehive.stats.autogather/name
   valuefalse/value
 /property
 property
   namehive.session.history.enabled/name
   valuetrue/value
 /property
 property
   namehive.querylog.location/name
   value/tmp/home/hive/log/${user.name}/value
 /property
 property
   namehive.exec.local.scratchdir/name
   value/tmp/hive/scratch/${user.name}/value
 /property
 property
   namehive.metastore.uris/name
   valuethrift://somehostname:9083/value
 /property
 !-- HIVE SERVER 2 --
 property
   namehive.server2.authentication/name
   valueKERBEROS/value
 /property
 property
   namehive.server2.authentication.kerberos.principal/name
   value***/value
 /property
 property
   namehive.server2.authentication.kerberos.keytab/name
   value***/value
 /property
 property
   namehive.server2.thrift.sasl.qop/name
   valueauth/value
   descriptionSasl QOP value; one of 'auth', 'auth-int' and 
 'auth-conf'/description
 /property
 property
   namehive.server2.enable.impersonation/name
   descriptionEnable user impersonation for HiveServer2/description
   valuetrue/value
 /property
 !-- HIVE METASTORE --
 property
   namehive.metastore.sasl.enabled/name
   valuetrue/value
 /property
 property
   namehive.metastore.kerberos.keytab.file/name
   value***/value
 /property
 property
   namehive.metastore.kerberos.principal/name
   value***/value
 /property
 property
   namehive.metastore.cache.pinobjtypes/name
   valueTable,Database,Type,FieldSchema,Order/value
 

[jira] [Resolved] (SPARK-6765) Turn scalastyle on for test code

2015-04-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6765.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Turn scalastyle on for test code
 

 Key: SPARK-6765
 URL: https://issues.apache.org/jira/browse/SPARK-6765
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, Tests
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.4.0


 We should turn scalastyle on for test code. Test code should be as important 
 as main code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread pankaj arora (JIRA)
pankaj arora created SPARK-6880:
---

 Summary: Spark Shutdowns with NoSuchElementException when running 
parallel collect on cachedRDD
 Key: SPARK-6880
 URL: https://issues.apache.org/jira/browse/SPARK-6880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: CentOs6.0, java7
Reporter: pankaj arora
 Fix For: 1.3.2


Spark Shutdowns with NoSuchElementException when running parallel collect on 
cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Hao (JIRA)
Hao created SPARK-6881:
--

 Summary: Change the checkpoint directory name from checkpoints to 
checkpoint
 Key: SPARK-6881
 URL: https://issues.apache.org/jira/browse/SPARK-6881
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Hao
Priority: Trivial


Name checkpoint instead of checkpoints is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6881:
---

Assignee: Apache Spark

 Change the checkpoint directory name from checkpoints to checkpoint
 ---

 Key: SPARK-6881
 URL: https://issues.apache.org/jira/browse/SPARK-6881
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Hao
Assignee: Apache Spark
Priority: Trivial

 Name checkpoint instead of checkpoints is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492548#comment-14492548
 ] 

Apache Spark commented on SPARK-6881:
-

User 'hlin09' has created a pull request for this issue:
https://github.com/apache/spark/pull/5493

 Change the checkpoint directory name from checkpoints to checkpoint
 ---

 Key: SPARK-6881
 URL: https://issues.apache.org/jira/browse/SPARK-6881
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Hao
Priority: Trivial

 Name checkpoint instead of checkpoints is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6881:
---

Assignee: (was: Apache Spark)

 Change the checkpoint directory name from checkpoints to checkpoint
 ---

 Key: SPARK-6881
 URL: https://issues.apache.org/jira/browse/SPARK-6881
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Hao
Priority: Trivial

 Name checkpoint instead of checkpoints is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6880:
-
Target Version/s:   (was: 1.3.2)
   Fix Version/s: (was: 1.3.2)

(Don't assign Target / Fix Version)

This is not a valid JIRA, as there is no detail. If you intend to add detail 
later, OK, but please next time wait until you have all of that information 
ready before opening a JIRA. Otherwise I'm going to close this.

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDD
 --

 Key: SPARK-6880
 URL: https://issues.apache.org/jira/browse/SPARK-6880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: CentOs6.0, java7
Reporter: pankaj arora

 Spark Shutdowns with NoSuchElementException when running parallel collect on 
 cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2