[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493350#comment-14493350 ] Yi Zhou commented on SPARK-5791: [~yhuai], yes, Both used Parquet. [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries
[ https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4638: --- Assignee: Apache Spark Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries --- Key: SPARK-4638 URL: https://issues.apache.org/jira/browse/SPARK-4638 Project: Spark Issue Type: New Feature Components: MLlib Reporter: madankumar s Assignee: Apache Spark Labels: Gaussian, Kernels, SVM Attachments: kernels-1.3.patch SPARK MLlib Classification Module: Add Kernel functionalities to SVM Classifier to find non linear patterns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries
[ https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493347#comment-14493347 ] Apache Spark commented on SPARK-4638: - User 'mandar2812' has created a pull request for this issue: https://github.com/apache/spark/pull/5503 Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries --- Key: SPARK-4638 URL: https://issues.apache.org/jira/browse/SPARK-4638 Project: Spark Issue Type: New Feature Components: MLlib Reporter: madankumar s Labels: Gaussian, Kernels, SVM Attachments: kernels-1.3.patch SPARK MLlib Classification Module: Add Kernel functionalities to SVM Classifier to find non linear patterns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries
[ https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4638: --- Assignee: (was: Apache Spark) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries --- Key: SPARK-4638 URL: https://issues.apache.org/jira/browse/SPARK-4638 Project: Spark Issue Type: New Feature Components: MLlib Reporter: madankumar s Labels: Gaussian, Kernels, SVM Attachments: kernels-1.3.patch SPARK MLlib Classification Module: Add Kernel functionalities to SVM Classifier to find non linear patterns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4766) ML Estimator Params should subclass Transformer Params
[ https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4766: - Description: Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator, where the Estimator params class extends the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} It's also weird to be able to: * Wrap LogisticRegressionModel (a Transformer) with CrossValidator * Pass a set of ParamMaps to CrossValidator which includes parameter LogisticRegressionModel.maxIter * (CrossValidator would try to set that parameter.) * I'm not sure if this would cause a failure or just be a noop. was: Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator, where the Estimator params class extends the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} (This is the only case where this happens currently, but it is worth setting a precedent.) ML Estimator Params should subclass Transformer Params -- Key: SPARK-4766 URL: https://issues.apache.org/jira/browse/SPARK-4766 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator, where the Estimator params class extends the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} It's also weird to be able to: * Wrap LogisticRegressionModel (a Transformer) with CrossValidator * Pass a set of ParamMaps to CrossValidator which includes parameter LogisticRegressionModel.maxIter * (CrossValidator would try to set that parameter.) * I'm not sure if this would cause a failure or just be a noop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6892) Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode
yangping wu created SPARK-6892: -- Summary: Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode Key: SPARK-6892 URL: https://issues.apache.org/jira/browse/SPARK-6892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu Priority: Critical When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falid to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falid, the stacktrace as follow: {code} 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6893) Better handling of pipeline parameters in PySpark
Xiangrui Meng created SPARK-6893: Summary: Better handling of pipeline parameters in PySpark Key: SPARK-6893 URL: https://issues.apache.org/jira/browse/SPARK-6893 Project: Spark Issue Type: Sub-task Components: PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng This is SPARK-5957 for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493585#comment-14493585 ] Kannan Rajah commented on SPARK-6511: - As requested by Patrick, here is an example of what we use in spark-env.sh for MapR distribution. MAPR_HADOOP_CLASSPATH=`hadoop classpath` MAPR_SPARK_CLASSPATH=$MAPR_HADOOP_CLASSPATH:$MAPR_HADOOP_HBASE_VERSION MAPR_HADOOP_JNI_PATH=`hadoop jnipath` export SPARK_LIBRARY_PATH=$MAPR_HADOOP_JNI_PATH SPARK_SUBMIT_CLASSPATH=$SPARK_SUBMIT_CLASSPATH:$MAPR_SPARK_CLASSPATH SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:$MAPR_HADOOP_JNI_PATH export SPARK_SUBMIT_CLASSPATH export SPARK_SUBMIT_LIBRARY_PATH Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493592#comment-14493592 ] Kannan Rajah commented on SPARK-6511: - [~pwendell] Just wanted to let you know that we also have a way to add hive and hbase jars to the classpath. This is useful when a setup has multiple versions of hive and hbase installed, but a Spark version will only work with specific version. We have some utility scripts to generate the right classpath entries based on a supported version of hive, hbase. If you think this will be useful in Apache distribution, I can create a JIRA and share the code. At a high level, there are 3 files: - compatibility.version: File that holds supported versions for each ecosystem component. hive_versions=0.13,0.12 hbase_versions=0.98 - compatible_version.sh: Returns the compatible version for a component by looking up compatibilty.version file. The first version that is available on the node is used. - generate_classpath.sh: Uses the above 2 files to generate the classpath. This script is used in spark-env.sh to generate classpath based on hive and hbase. Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5924) Add the ability to specify withMean or withStd parameters with StandarScaler
[ https://issues.apache.org/jira/browse/SPARK-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5924: --- Assignee: (was: Apache Spark) Add the ability to specify withMean or withStd parameters with StandarScaler Key: SPARK-5924 URL: https://issues.apache.org/jira/browse/SPARK-5924 Project: Spark Issue Type: Improvement Components: ML Reporter: Jao Rabary Priority: Trivial The current implementation of StandarScaler calls mllib.feature.StandardScaler default constructor directly without offering the ability to specify withMean or withStd parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set
[ https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493517#comment-14493517 ] Jack Hu commented on SPARK-6847: Here is the part of the stack (Full stack at: https://gist.github.com/jhu-chang/38a6c052aff1d666b785) {quote} 15/04/14 11:28:20 [Executor task launch worker-1] ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 27554.0 (TID 3801) java.lang.StackOverflowError at java.io.ObjectStreamClass.setPrimFieldValues(ObjectStreamClass.java:1243) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1984) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at scala.collection.immutable.$colon$colon.readObject(List.scala:366) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) {quote} Stack overflow on updateStateByKey which followed by a dstream with checkpoint set -- Key: SPARK-6847 URL: https://issues.apache.org/jira/browse/SPARK-6847 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Jack Hu Labels: StackOverflowError, Streaming The issue happens with the following sample code: uses {{updateStateByKey}} followed by a {{map}} with checkpoint
[jira] [Updated] (SPARK-6892) Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated SPARK-6892: --- Description: When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falied to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falied, the stacktrace as follow: {code} 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {code} This exception will cause the job falied. was: When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falid to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falid, the stacktrace as follow: {code} 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {code} Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode Key: SPARK-6892 URL: https://issues.apache.org/jira/browse/SPARK-6892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu Priority: Critical When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falied to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falied, the stacktrace as follow: {code} 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {code} This exception will cause the job falied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5957) Better handling of default parameter values.
[ https://issues.apache.org/jira/browse/SPARK-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5957. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5431 [https://github.com/apache/spark/pull/5431] Better handling of default parameter values. Key: SPARK-5957 URL: https://issues.apache.org/jira/browse/SPARK-5957 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 We store the default value of a parameter in the Param instance. In many cases, the default value depends on the algorithm and other parameters defined in the same algorithm. We need to think a better approach to handle default parameter values. The design doc was posted in the parent JIRA: https://issues.apache.org/jira/browse/SPARK-5874 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492868#comment-14492868 ] Max Kaznady commented on SPARK-6884: Implemented a prototype, testing mapReduce code. random forest predict probabilities functionality (like in sklearn) --- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Environment: cross-platform Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492871#comment-14492871 ] Max Kaznady commented on SPARK-3727: I thought it would be more fitting to separate this: https://issues.apache.org/jira/browse/SPARK-6884 DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6884) Random forest: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6884: - Summary: Random forest: predict class probabilities (was: random forest predict probabilities functionality (like in sklearn)) Random forest: predict class probabilities -- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Environment: cross-platform Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6884: - Issue Type: Sub-task (was: New Feature) Parent: SPARK-3727 random forest predict probabilities functionality (like in sklearn) --- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Environment: cross-platform Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6883) Fork pyspark's cloudpickle as a separate dependency
Kyle Kelley created SPARK-6883: -- Summary: Fork pyspark's cloudpickle as a separate dependency Key: SPARK-6883 URL: https://issues.apache.org/jira/browse/SPARK-6883 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Kyle Kelley IPython, pyspark, picloud/multyvac/cloudpipe all rely on cloudpickle from various sources (cloud, pyspark, and multyvac correspondingly). It would be great to have this as a separately maintained project that can: * Work with Python3 * Add tests! * Use higher order pickling (when on Python3) * Be installed with pip We're starting this off at the PyCon sprints under https://github.com/cloudpipe/cloudpickle. We'd like to coordinate with PySpark to make it work across all the above mentioned projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile
[ https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6352: -- Assignee: Pei-Lun Lee Supporting non-default OutputCommitter when using saveAsParquetFile --- Key: SPARK-6352 URL: https://issues.apache.org/jira/browse/SPARK-6352 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.2.1, 1.3.0 Reporter: Pei-Lun Lee Assignee: Pei-Lun Lee Fix For: 1.4.0 SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can be nice to have similar behavior in saveAsParquetFile. It maybe difficult to have a fully customizable OutputCommitter solution, at least adding something like a DirectParquetOutputCommitter and letting users choose between this and the default should be enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile
[ https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492699#comment-14492699 ] Josh Rosen commented on SPARK-6352: --- [~lian cheng], we can only assign tickets to users who have the proper role in Spark's JIRA permissions. I've added [~pllee] to the Contributors role and will assign this ticket to them. Supporting non-default OutputCommitter when using saveAsParquetFile --- Key: SPARK-6352 URL: https://issues.apache.org/jira/browse/SPARK-6352 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.2.1, 1.3.0 Reporter: Pei-Lun Lee Fix For: 1.4.0 SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can be nice to have similar behavior in saveAsParquetFile. It maybe difficult to have a fully customizable OutputCommitter solution, at least adding something like a DirectParquetOutputCommitter and letting users choose between this and the default should be enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-5888: - Assignee: Sandy Ryza Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6849) The constructor of GradientDescent should be public
[ https://issues.apache.org/jira/browse/SPARK-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6849. -- Resolution: Duplicate Yes, I think this is a subset of opening up optimization APIs The constructor of GradientDescent should be public --- Key: SPARK-6849 URL: https://issues.apache.org/jira/browse/SPARK-6849 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: Guoqiang Li Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5632: --- Description: My cassandra table task_trace has a field sm.result which contains dot in the name. So SQL tried to look up sm instead of full name 'sm.result'. Here is my code: {code} scala import org.apache.spark.sql.cassandra.CassandraSQLContext scala val cc = new CassandraSQLContext(sc) scala val task_trace = cc.jsonFile(/task_trace.json) scala task_trace.registerTempTable(task_trace) scala cc.setKeyspace(cerberus_data_v4) scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, task_body.sm.result FROM task_trace WHERE task_id = 'fff7304e-9984-4b45-b10c-0423a96745ce') res: org.apache.spark.sql.SchemaRDD = SchemaRDD[57] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, cerberus_id, couponId, coupon_code, created, description, domain, expires, message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, validity {code} The full schema look like this: {code} scala task_trace.printSchema() root \|-- received_datetime: long (nullable = true) \|-- task_body: struct (nullable = true) \|\|-- cerberus_batch_id: string (nullable = true) \|\|-- cerberus_id: string (nullable = true) \|\|-- couponId: integer (nullable = true) \|\|-- coupon_code: string (nullable = true) \|\|-- created: string (nullable = true) \|\|-- description: string (nullable = true) \|\|-- domain: string (nullable = true) \|\|-- expires: string (nullable = true) \|\|-- message_id: string (nullable = true) \|\|-- neverShowAfter: string (nullable = true) \|\|-- neverShowBefore: string (nullable = true) \|\|-- offerTitle: string (nullable = true) \|\|-- screenshots: array (nullable = true) \|\|\|-- element: string (containsNull = false) \|\|-- sm.result: struct (nullable = true) \|\|\|-- cerberus_batch_id: string (nullable = true) \|\|\|-- cerberus_id: string (nullable = true) \|\|\|-- code: string (nullable = true) \|\|\|-- couponId: integer (nullable = true) \|\|\|-- created: string (nullable = true) \|\|\|-- description: string (nullable = true) \|\|\|-- domain: string (nullable = true) \|\|\|-- expires: string (nullable = true) \|\|\|-- message_id: string (nullable = true) \|\|\|-- neverShowAfter: string (nullable = true) \|\|\|-- neverShowBefore: string (nullable = true) \|\|\|-- offerTitle: string (nullable = true) \|\|\|-- result: struct (nullable = true) \|\|\|\|-- post: struct (nullable = true) \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: boolean (nullable = true) \|\|\|\|\|-- meta: struct (nullable = true) \|\|\|\|\|\|-- None_tx_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- exceptions: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- no_input_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_mapped: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_transformed: array (nullable = true) \|\|\|\|\|\|\|-- element: array (containsNull = false) \|\|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|-- now_price_checkout: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|\|-- shipping_price: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|\|-- tax: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|\|-- total: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|-- pre: struct (nullable = true) \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) \|\|\|\|
[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set
[ https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492126#comment-14492126 ] Sean Owen commented on SPARK-6847: -- Can you provide (the top part of) the stack overflow stack? so we can see where it's occurring. I think it's something building a very long object graph but that is the first step to confirm. Stack overflow on updateStateByKey which followed by a dstream with checkpoint set -- Key: SPARK-6847 URL: https://issues.apache.org/jira/browse/SPARK-6847 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Jack Hu Labels: StackOverflowError, Streaming The issue happens with the following sample code: uses {{updateStateByKey}} followed by a {{map}} with checkpoint interval 10 seconds {code} val sparkConf = new SparkConf().setAppName(test) val streamingContext = new StreamingContext(sparkConf, Seconds(10)) streamingContext.checkpoint(checkpoint) val source = streamingContext.socketTextStream(localhost, ) val updatedResult = source.map( (1,_)).updateStateByKey( (newlist : Seq[String], oldstate : Option[String]) = newlist.headOption.orElse(oldstate)) updatedResult.map(_._2) .checkpoint(Seconds(10)) .foreachRDD((rdd, t) = { println(Deep: + rdd.toDebugString.split(\n).length) println(t.toString() + : + rdd.collect.length) }) streamingContext.start() streamingContext.awaitTermination() {code} From the output, we can see that the dependency will be increasing time over time, the {{updateStateByKey}} never get check-pointed, and finally, the stack overflow will happen. Note: * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but not the {{updateStateByKey}} * If remove the {{checkpoint(Seconds(10))}} from the map result ( {{updatedResult.map(_._2)}} ), the stack overflow will not happen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6303) Remove unnecessary Average in GeneratedAggregate
[ https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-6303: --- Summary: Remove unnecessary Average in GeneratedAggregate (was: Average should be in canBeCodeGened list) Remove unnecessary Average in GeneratedAggregate Key: SPARK-6303 URL: https://issues.apache.org/jira/browse/SPARK-6303 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, CollectHashSet. Average should be in the list too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6303) Average should be in canBeCodeGened list
[ https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-6303: --- Issue Type: Improvement (was: Bug) Average should be in canBeCodeGened list Key: SPARK-6303 URL: https://issues.apache.org/jira/browse/SPARK-6303 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, CollectHashSet. Average should be in the list too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6303) Average should be in canBeCodeGened list
[ https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-6303: --- Priority: Minor (was: Major) Average should be in canBeCodeGened list Key: SPARK-6303 URL: https://issues.apache.org/jira/browse/SPARK-6303 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, CollectHashSet. Average should be in the list too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6877) Add code generation support for Min
[ https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492041#comment-14492041 ] Apache Spark commented on SPARK-6877: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5487 Add code generation support for Min --- Key: SPARK-6877 URL: https://issues.apache.org/jira/browse/SPARK-6877 Project: Spark Issue Type: New Feature Components: SQL Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6877) Add code generation support for Min
[ https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6877: --- Assignee: (was: Apache Spark) Add code generation support for Min --- Key: SPARK-6877 URL: https://issues.apache.org/jira/browse/SPARK-6877 Project: Spark Issue Type: New Feature Components: SQL Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6877) Add code generation support for Min
Liang-Chi Hsieh created SPARK-6877: -- Summary: Add code generation support for Min Key: SPARK-6877 URL: https://issues.apache.org/jira/browse/SPARK-6877 Project: Spark Issue Type: New Feature Components: SQL Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6877) Add code generation support for Min
[ https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6877: --- Assignee: Apache Spark Add code generation support for Min --- Key: SPARK-6877 URL: https://issues.apache.org/jira/browse/SPARK-6877 Project: Spark Issue Type: New Feature Components: SQL Reporter: Liang-Chi Hsieh Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6303) Remove unnecessary Average in GeneratedAggregate
[ https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-6303: --- Description: Because {{Average}} is a {{PartialAggregate}}, we never get a {{Average}} node when reaching {{HashAggregation}} to prepare {{GeneratedAggregate}}. That is why in SQLQuerySuite there is already a test for {{avg}} with codegen. And it works. But we can find a case in {{GeneratedAggregate}} to deal with {{Average}}. Based on the above, we actually never execute this case. So we can remove this case from {{GeneratedAggregate}}. was:Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, CollectHashSet. Average should be in the list too. Remove unnecessary Average in GeneratedAggregate Key: SPARK-6303 URL: https://issues.apache.org/jira/browse/SPARK-6303 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Because {{Average}} is a {{PartialAggregate}}, we never get a {{Average}} node when reaching {{HashAggregation}} to prepare {{GeneratedAggregate}}. That is why in SQLQuerySuite there is already a test for {{avg}} with codegen. And it works. But we can find a case in {{GeneratedAggregate}} to deal with {{Average}}. Based on the above, we actually never execute this case. So we can remove this case from {{GeneratedAggregate}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491993#comment-14491993 ] Alberto commented on SPARK-4783: Does it mean that you guys are going to create a PR with a fix/change proposal for this? Or just asking someone to create that PR? If so I am willing to create it. System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4961) Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time
[ https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4961: --- Assignee: (was: Apache Spark) Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time --- Key: SPARK-4961 URL: https://issues.apache.org/jira/browse/SPARK-4961 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: YanTang Zhai Priority: Minor HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much time. For example, in our cluster, it needs from 0.029s to 766.699s. If one JobSubmitted event is processing, others should wait. Thus, we want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't need to wait much time. HadoopRDD object could get its partitons when it is instantiated. We could analyse and compare the execution time before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] (1) The app has only one job (a) The execution time of the job before optimization is [time1__][time2_][time3___][time4_]. The execution time of the job after optimization is[time1__][time3___][time2_][time4_]. In summary, if the app has only one job, the total execution time is same before and after optimization. (2) The app has 4 jobs (a) Before optimization, job1 execution time is [time2_][time3___][time4_], job2 execution time is [time2__][time3___][time4_], job3 execution time is[time2][time3___][time4_], job4 execution time is[time2_][time3___][time4_]. After optimization, job1 execution time is [time3___][time2_][time4_], job2 execution time is [time3___][time2__][time4_], job3 execution time is[time3___][time2_][time4_], job4 execution time is[time3___][time2__][time4_]. In summary, if the app has multiple jobs, average execution time after optimization is less than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4961) Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time
[ https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4961: --- Assignee: Apache Spark Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time --- Key: SPARK-4961 URL: https://issues.apache.org/jira/browse/SPARK-4961 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: YanTang Zhai Assignee: Apache Spark Priority: Minor HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much time. For example, in our cluster, it needs from 0.029s to 766.699s. If one JobSubmitted event is processing, others should wait. Thus, we want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't need to wait much time. HadoopRDD object could get its partitons when it is instantiated. We could analyse and compare the execution time before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] (1) The app has only one job (a) The execution time of the job before optimization is [time1__][time2_][time3___][time4_]. The execution time of the job after optimization is[time1__][time3___][time2_][time4_]. In summary, if the app has only one job, the total execution time is same before and after optimization. (2) The app has 4 jobs (a) Before optimization, job1 execution time is [time2_][time3___][time4_], job2 execution time is [time2__][time3___][time4_], job3 execution time is[time2][time3___][time4_], job4 execution time is[time2_][time3___][time4_]. After optimization, job1 execution time is [time3___][time2_][time4_], job2 execution time is [time3___][time2__][time4_], job3 execution time is[time3___][time2_][time4_], job4 execution time is[time3___][time2__][time4_]. In summary, if the app has multiple jobs, average execution time after optimization is less than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6562) DataFrame.na.replace value support in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6562: --- Summary: DataFrame.na.replace value support in Scala/Java (was: DataFrame.na.replace value support) DataFrame.na.replace value support in Scala/Java Key: SPARK-6562 URL: https://issues.apache.org/jira/browse/SPARK-6562 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.4.0 Support replacing a set of values with another set of values (i.e. map join), similar to Pandas' replace. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491965#comment-14491965 ] Yu Ishikawa commented on SPARK-6682: [~josephkb] sorry, one more question. Are we allowed to add test suites in spark.examples? We don't have any test suites in spark.examples. However, I think we should have them for their performance guarantee. And it is a good timing to add them in this issue. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
[ https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6868. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.2 Assignee: Dean Chen Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY --- Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0 Reporter: Dean Chen Assignee: Dean Chen Fix For: 1.3.2, 1.4.0 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
[ https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6868: - Priority: Minor (was: Major) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY --- Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0 Reporter: Dean Chen Assignee: Dean Chen Priority: Minor Fix For: 1.3.2, 1.4.0 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6860) Fix the possible inconsistency of StreamingPage
[ https://issues.apache.org/jira/browse/SPARK-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6860: - Priority: Minor (was: Major) Assignee: Shixiong Zhu Fix the possible inconsistency of StreamingPage --- Key: SPARK-6860 URL: https://issues.apache.org/jira/browse/SPARK-6860 Project: Spark Issue Type: Bug Components: Streaming, Web UI Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.4.0 Because StreamingPage.render doesn't hold the listener lock when generating the content, the different parts of content may have some inconsistent values if listener updates its status at the same time. And it will confuse people. We should add listener.synchronized to make sure we have a consistent view of StreamingJobProgressListener when creating the content. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted
[ https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6870: - Priority: Trivial (was: Minor) Assignee: Weizhong Catch InterruptedException when yarn application state monitor thread been interrupted -- Key: SPARK-6870 URL: https://issues.apache.org/jira/browse/SPARK-6870 Project: Spark Issue Type: Improvement Components: YARN Reporter: Weizhong Assignee: Weizhong Priority: Trivial Fix For: 1.4.0 On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492215#comment-14492215 ] Sean Owen commented on SPARK-1529: -- (Sorry if this double-posts.) Is there a good way to see the whole diff at the moment? I know there's a branch with individual commits. Maybe I am missing something basic. This puts a new abstraction on top of a Hadoop FileSystem on top of the underlying file system abstraction. That's getting heavy. If it's only abstracting access to an InputStream / OutputStream, why is it needed? that's already directly available from, say, Hadoop's FileSystem. What would be the performance gain if this is the bit being swapped out? This is my original question -- you shuffle to HDFS, then read it back to send it again via the existing shuffle? It kind of made sense when the idea was to swap the whole shuffle to replace its transport. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6878) Sum on empty RDD fails with exception
Erik van Oosten created SPARK-6878: -- Summary: Sum on empty RDD fails with exception Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader
[ https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6762. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5407 [https://github.com/apache/spark/pull/5407] Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader Key: SPARK-6762 URL: https://issues.apache.org/jira/browse/SPARK-6762 Project: Spark Issue Type: Bug Components: Streaming Reporter: zhichao-li Priority: Minor Fix For: 1.4.0 The close action should be placed within finally block to avoid the potential resource leaks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492302#comment-14492302 ] Erik van Oosten commented on SPARK-6878: Ah, yes. I now see that fold also first reduces per partition. Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6440) ipv6 URI for HttpServer
[ https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6440: - Assignee: Arsenii Krasikov ipv6 URI for HttpServer --- Key: SPARK-6440 URL: https://issues.apache.org/jira/browse/SPARK-6440 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster Reporter: Arsenii Krasikov Assignee: Arsenii Krasikov Priority: Minor Fix For: 1.4.0 In {{org.apache.spark.HttpServer}} uri is generated as {code:java}spark:// + localHostname + : + masterPort{code}, where {{localHostname}} is {code:java} org.apache.spark.util.Utils.localHostName() = customHostname.getOrElse(localIpAddressHostname){code}. If the host has an ipv6 address then it would be interpolated into invalid URI: {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}. The solution is to separate uri and hostname entities. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6440) ipv6 URI for HttpServer
[ https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6440. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5424 [https://github.com/apache/spark/pull/5424] ipv6 URI for HttpServer --- Key: SPARK-6440 URL: https://issues.apache.org/jira/browse/SPARK-6440 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster Reporter: Arsenii Krasikov Priority: Minor Fix For: 1.4.0 In {{org.apache.spark.HttpServer}} uri is generated as {code:java}spark:// + localHostname + : + masterPort{code}, where {{localHostname}} is {code:java} org.apache.spark.util.Utils.localHostName() = customHostname.getOrElse(localIpAddressHostname){code}. If the host has an ipv6 address then it would be interpolated into invalid URI: {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}. The solution is to separate uri and hostname entities. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted
[ https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6870. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5479 [https://github.com/apache/spark/pull/5479] Catch InterruptedException when yarn application state monitor thread been interrupted -- Key: SPARK-6870 URL: https://issues.apache.org/jira/browse/SPARK-6870 Project: Spark Issue Type: Improvement Components: YARN Reporter: Weizhong Priority: Minor Fix For: 1.4.0 On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6738. -- Resolution: Not A Problem We can reopen if there is more detail, but the problem report is focusing on the size of one spill file when there are lots of them. The in-memory size is also not necessarily the on-disk size. I haven't seen a report of a problem here either, like something that then fails. EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader
[ https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6762: - Assignee: zhichao-li Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader Key: SPARK-6762 URL: https://issues.apache.org/jira/browse/SPARK-6762 Project: Spark Issue Type: Bug Components: Streaming Reporter: zhichao-li Assignee: zhichao-li Priority: Minor Fix For: 1.4.0 The close action should be placed within finally block to avoid the potential resource leaks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492246#comment-14492246 ] Yajun Dong commented on SPARK-5281: --- I also have this isssue with Eclipse Luna and spark 1.3.0, any idea ? Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sarsol Priority: Critical Application crashes on this line {{rdd.registerTempTable(temp)}} in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.
[ https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6800: --- Assignee: Apache Spark Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results. -- Key: SPARK-6800 URL: https://issues.apache.org/jira/browse/SPARK-6800 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, Scala 2.10 Reporter: Micael Capitão Assignee: Apache Spark Having a Derby table with people info (id, name, age) defined like this: {code} val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true val conn = DriverManager.getConnection(jdbcUrl) val stmt = conn.createStatement() stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)) stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)) stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)) stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)) stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)) {code} If I try to read that table from Spark SQL with lower/upper bounds, like this: {code} val people = sqlContext.jdbc(url = jdbcUrl, table = Person, columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10) people.show() {code} I get this result: {noformat} PERSON_ID NAME AGE 3 Ana Rita Costa 12 5 Miguel Costa 15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 1 Armando Carvalho 50 {noformat} Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!). Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following: {code} (0) age 4,0 (1) age = 4 AND age 8,1 (2) age = 8 AND age 12,2 (3) age = 12 AND age 16,3 (4) age = 16 AND age 20,4 (5) age = 20 AND age 24,5 (6) age = 24 AND age 28,6 (7) age = 28 AND age 32,7 (8) age = 32 AND age 36,8 (9) age = 36,9 {code} The last condition ignores the upper bound and the other ones may result in repeated rows being read. Using the JdbcRDD (and converting it to a DataFrame) I would have something like this: {code} val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl), SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10, rs = (rs.getInt(1), rs.getString(2), rs.getInt(3))) val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE) people.show() {code} Resulting in: {noformat} PERSON_ID NAMEAGE 3 Ana Rita Costa 12 5 Miguel Costa15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 {noformat} Which is correct! Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} I've found it generates the following: {code} (0) age = 0 AND age = 3 (1) age = 4 AND age = 7 (2) age = 8 AND age = 11 (3) age = 12 AND age = 15 (4) age = 16 AND age = 19 (5) age = 20 AND age = 23 (6) age = 24 AND age = 27 (7) age = 28 AND age = 31 (8) age = 32 AND age = 35 (9) age = 36 AND age = 40 {code} This is the behaviour I was expecting from the Spark SQL version. Is the Spark SQL version buggy or is this some weird expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.
[ https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492244#comment-14492244 ] Apache Spark commented on SPARK-6800: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5488 Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results. -- Key: SPARK-6800 URL: https://issues.apache.org/jira/browse/SPARK-6800 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, Scala 2.10 Reporter: Micael Capitão Having a Derby table with people info (id, name, age) defined like this: {code} val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true val conn = DriverManager.getConnection(jdbcUrl) val stmt = conn.createStatement() stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)) stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)) stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)) stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)) stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)) {code} If I try to read that table from Spark SQL with lower/upper bounds, like this: {code} val people = sqlContext.jdbc(url = jdbcUrl, table = Person, columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10) people.show() {code} I get this result: {noformat} PERSON_ID NAME AGE 3 Ana Rita Costa 12 5 Miguel Costa 15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 1 Armando Carvalho 50 {noformat} Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!). Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following: {code} (0) age 4,0 (1) age = 4 AND age 8,1 (2) age = 8 AND age 12,2 (3) age = 12 AND age 16,3 (4) age = 16 AND age 20,4 (5) age = 20 AND age 24,5 (6) age = 24 AND age 28,6 (7) age = 28 AND age 32,7 (8) age = 32 AND age 36,8 (9) age = 36,9 {code} The last condition ignores the upper bound and the other ones may result in repeated rows being read. Using the JdbcRDD (and converting it to a DataFrame) I would have something like this: {code} val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl), SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10, rs = (rs.getInt(1), rs.getString(2), rs.getInt(3))) val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE) people.show() {code} Resulting in: {noformat} PERSON_ID NAMEAGE 3 Ana Rita Costa 12 5 Miguel Costa15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 {noformat} Which is correct! Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} I've found it generates the following: {code} (0) age = 0 AND age = 3 (1) age = 4 AND age = 7 (2) age = 8 AND age = 11 (3) age = 12 AND age = 15 (4) age = 16 AND age = 19 (5) age = 20 AND age = 23 (6) age = 24 AND age = 27 (7) age = 28 AND age = 31 (8) age = 32 AND age = 35 (9) age = 36 AND age = 40 {code} This is the behaviour I was expecting from the Spark SQL version. Is the Spark SQL version buggy or is this some weird expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6860) Fix the possible inconsistency of StreamingPage
[ https://issues.apache.org/jira/browse/SPARK-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6860. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5470 [https://github.com/apache/spark/pull/5470] Fix the possible inconsistency of StreamingPage --- Key: SPARK-6860 URL: https://issues.apache.org/jira/browse/SPARK-6860 Project: Spark Issue Type: Bug Components: Streaming, Web UI Reporter: Shixiong Zhu Fix For: 1.4.0 Because StreamingPage.render doesn't hold the listener lock when generating the content, the different parts of content may have some inconsistent values if listener updates its status at the same time. And it will confuse people. We should add listener.synchronized to make sure we have a consistent view of StreamingJobProgressListener when creating the content. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492284#comment-14492284 ] Sean Owen commented on SPARK-6878: -- Yes, and I think it could even be a little simpler by calling {{fold(0.0)(_ + _)}} ? Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6878: --- Assignee: Apache Spark Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Assignee: Apache Spark Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492336#comment-14492336 ] Erik van Oosten commented on SPARK-6878: Pull request: https://github.com/apache/spark/pull/5489 Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492271#comment-14492271 ] Sean Owen commented on SPARK-6878: -- Interesting question -- what's the expected sum of nothing at all? although I can see the argument both ways, 0 is probably the better result since {{Array[Double]().sum}} is 0. So {{sc.parallelize(Array[Double]()).sum}} should as well. Want to make a PR? Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492282#comment-14492282 ] Erik van Oosten commented on SPARK-6878: The answer is only defined because the RDD is an {{RDD[Double]}} :) Sure, I'll make a PR. Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.
[ https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6800: --- Assignee: (was: Apache Spark) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results. -- Key: SPARK-6800 URL: https://issues.apache.org/jira/browse/SPARK-6800 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, Scala 2.10 Reporter: Micael Capitão Having a Derby table with people info (id, name, age) defined like this: {code} val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true val conn = DriverManager.getConnection(jdbcUrl) val stmt = conn.createStatement() stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)) stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)) stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)) stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)) stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)) {code} If I try to read that table from Spark SQL with lower/upper bounds, like this: {code} val people = sqlContext.jdbc(url = jdbcUrl, table = Person, columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10) people.show() {code} I get this result: {noformat} PERSON_ID NAME AGE 3 Ana Rita Costa 12 5 Miguel Costa 15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 1 Armando Carvalho 50 {noformat} Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!). Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following: {code} (0) age 4,0 (1) age = 4 AND age 8,1 (2) age = 8 AND age 12,2 (3) age = 12 AND age 16,3 (4) age = 16 AND age 20,4 (5) age = 20 AND age 24,5 (6) age = 24 AND age 28,6 (7) age = 28 AND age 32,7 (8) age = 32 AND age 36,8 (9) age = 36,9 {code} The last condition ignores the upper bound and the other ones may result in repeated rows being read. Using the JdbcRDD (and converting it to a DataFrame) I would have something like this: {code} val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl), SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10, rs = (rs.getInt(1), rs.getString(2), rs.getInt(3))) val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE) people.show() {code} Resulting in: {noformat} PERSON_ID NAMEAGE 3 Ana Rita Costa 12 5 Miguel Costa15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 {noformat} Which is correct! Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} I've found it generates the following: {code} (0) age = 0 AND age = 3 (1) age = 4 AND age = 7 (2) age = 8 AND age = 11 (3) age = 12 AND age = 15 (4) age = 16 AND age = 19 (5) age = 20 AND age = 23 (6) age = 24 AND age = 27 (7) age = 28 AND age = 31 (8) age = 32 AND age = 35 (9) age = 36 AND age = 40 {code} This is the behaviour I was expecting from the Spark SQL version. Is the Spark SQL version buggy or is this some weird expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492303#comment-14492303 ] Steve Loughran commented on SPARK-1537: --- HADOOP-11826 patches the hadoop compatibility document to add timeline server to the list of stable APIs. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Attachments: SPARK-1537.txt, spark-1573.patch It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6671) Add status command for spark daemons
[ https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6671. -- Resolution: Fixed Issue resolved by pull request 5327 [https://github.com/apache/spark/pull/5327] Add status command for spark daemons Key: SPARK-6671 URL: https://issues.apache.org/jira/browse/SPARK-6671 Project: Spark Issue Type: Improvement Components: Deploy Reporter: PRADEEP CHANUMOLU Labels: easyfix Fix For: 1.4.0 Original Estimate: 24h Remaining Estimate: 24h Currently using the spark-daemon.sh script we can start and stop the spark demons. But we cannot get the status of the daemons. It will be nice to include the status command in the spark-daemon.sh script, through which we can know if the spark demon is alive or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6671) Add status command for spark daemons
[ https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6671: - Priority: Minor (was: Major) Assignee: PRADEEP CHANUMOLU Add status command for spark daemons Key: SPARK-6671 URL: https://issues.apache.org/jira/browse/SPARK-6671 Project: Spark Issue Type: Improvement Components: Deploy Reporter: PRADEEP CHANUMOLU Assignee: PRADEEP CHANUMOLU Priority: Minor Labels: easyfix Fix For: 1.4.0 Original Estimate: 24h Remaining Estimate: 24h Currently using the spark-daemon.sh script we can start and stop the spark demons. But we cannot get the status of the daemons. It will be nice to include the status command in the spark-daemon.sh script, through which we can know if the spark demon is alive or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6879) Check if the app is completed before clean it up
Tao Wang created SPARK-6879: --- Summary: Check if the app is completed before clean it up Key: SPARK-6879 URL: https://issues.apache.org/jira/browse/SPARK-6879 Project: Spark Issue Type: Bug Components: Deploy Reporter: Tao Wang Now history server deletes the directory whichi expires according to its modification time. It is not good for those long-running applicaitons, as they might be deleted before finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik van Oosten updated SPARK-6878: --- Flags: Patch Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6878: --- Assignee: (was: Apache Spark) Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492335#comment-14492335 ] Apache Spark commented on SPARK-6878: - User 'erikvanoosten' has created a pull request for this issue: https://github.com/apache/spark/pull/5489 Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6875) Add support for Joda-time types
[ https://issues.apache.org/jira/browse/SPARK-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Grandjean updated SPARK-6875: - Description: The need comes from the following use case: val objs: RDD[MyClass] = [...] val sqlC = new org.apache.spark.sql.SQLContext(sc) import sqlC._ objs.saveAsParquetFile(parquet) MyClass contains joda-time fields. When saving to parquet file, an exception is thrown (matchError in ScalaReflection.scala). Spark SQL supports java SQL date/time types. This request is to add support for Joda-time types. It is possible to define UDT's using the @SQLUserDefinedType annotation. However, in addition to annotations, it would be nice to be able to programmatically/dynamically add UDTs. was: The need comes from the following use case: val objs: RDD[MyClass] = [...] val sqlC = new org.apache.spark.sql.SQLContext(sc) import sqlC._ objs.saveAsParquetFile(parquet) MyClass contains joda-time fields. When saving to parquet file, an exception is thrown (matchError in ScalaReflection.scala). Spark SQL supports java SQL date/time types. This request is to add support for Joda-time types. Another alternative would be, in addition to annotations, to be able to programmatically and dynamically add UDTs. Add support for Joda-time types --- Key: SPARK-6875 URL: https://issues.apache.org/jira/browse/SPARK-6875 Project: Spark Issue Type: Improvement Components: SQL Reporter: Patrick Grandjean The need comes from the following use case: val objs: RDD[MyClass] = [...] val sqlC = new org.apache.spark.sql.SQLContext(sc) import sqlC._ objs.saveAsParquetFile(parquet) MyClass contains joda-time fields. When saving to parquet file, an exception is thrown (matchError in ScalaReflection.scala). Spark SQL supports java SQL date/time types. This request is to add support for Joda-time types. It is possible to define UDT's using the @SQLUserDefinedType annotation. However, in addition to annotations, it would be nice to be able to programmatically/dynamically add UDTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile
[ https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6352. --- Resolution: Fixed Fix Version/s: 1.4.0 Target Version/s: 1.4.0 Resolved by https://github.com/apache/spark/pull/5042 [~pwendell] Tried to assign this ticket to [~pllee], but couldn't put his name in the Assignee field. Do we need to set some privilege stuff? Supporting non-default OutputCommitter when using saveAsParquetFile --- Key: SPARK-6352 URL: https://issues.apache.org/jira/browse/SPARK-6352 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.2.1, 1.3.0 Reporter: Pei-Lun Lee Fix For: 1.4.0 SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can be nice to have similar behavior in saveAsParquetFile. It maybe difficult to have a fully customizable OutputCommitter solution, at least adding something like a DirectParquetOutputCommitter and letting users choose between this and the default should be enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6875) Add support for Joda-time types
[ https://issues.apache.org/jira/browse/SPARK-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Grandjean updated SPARK-6875: - Description: The need comes from the following use case: val objs: RDD[MyClass] = [...] val sqlC = new org.apache.spark.sql.SQLContext(sc) import sqlC._ objs.saveAsParquetFile(parquet) MyClass contains joda-time fields. When saving to parquet file, an exception is thrown (matchError in ScalaReflection.scala). Spark SQL supports java SQL date/time types. This request is to add support for Joda-time types. Another alternative would be, in addition to annotations, to be able to programmatically and dynamically add UDTs. was: The need comes from the following use case: val objs: RDD[MyClass] = [...] val sqlC = new org.apache.spark.sql.SQLContext(sc) import sqlC._ objs.saveAsParquetFile(parquet) MyClass contains joda-time fields. When saving to parquet file, an exception is thrown (matchError in ScalaReflection.scala). Spark SQL supports java SQL date/time types. This request is to add support for Joda-time types. Add support for Joda-time types --- Key: SPARK-6875 URL: https://issues.apache.org/jira/browse/SPARK-6875 Project: Spark Issue Type: Improvement Components: SQL Reporter: Patrick Grandjean The need comes from the following use case: val objs: RDD[MyClass] = [...] val sqlC = new org.apache.spark.sql.SQLContext(sc) import sqlC._ objs.saveAsParquetFile(parquet) MyClass contains joda-time fields. When saving to parquet file, an exception is thrown (matchError in ScalaReflection.scala). Spark SQL supports java SQL date/time types. This request is to add support for Joda-time types. Another alternative would be, in addition to annotations, to be able to programmatically and dynamically add UDTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6207) YARN secure cluster mode doesn't obtain a hive-metastore token
[ https://issues.apache.org/jira/browse/SPARK-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-6207. -- Resolution: Fixed Fix Version/s: 1.4.0 YARN secure cluster mode doesn't obtain a hive-metastore token --- Key: SPARK-6207 URL: https://issues.apache.org/jira/browse/SPARK-6207 Project: Spark Issue Type: Bug Components: Spark Submit, SQL, YARN Affects Versions: 1.2.0, 1.2.1, 1.3.0 Environment: YARN Reporter: Doug Balog Fix For: 1.4.0 When running a spark job, on YARN in secure mode, with --deploy-mode cluster, org.apache.spark.deploy.yarn.Client() does not obtain a delegation token to the hive-metastore. Therefore any attempts to talk to the hive-metastore fail with a GSSException: No valid credentials provided... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5689) Document what can be run in different YARN modes
[ https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5689: --- Assignee: (was: Apache Spark) Document what can be run in different YARN modes Key: SPARK-5689 URL: https://issues.apache.org/jira/browse/SPARK-5689 Project: Spark Issue Type: Documentation Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves We should document what can be run in the different yarn modes. For instances, the interactive shell only work in yarn client mode, recently with https://github.com/apache/spark/pull/3976 users can run python scripts in cluster mode, etc.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6879) Check if the app is completed before clean it up
[ https://issues.apache.org/jira/browse/SPARK-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492348#comment-14492348 ] Apache Spark commented on SPARK-6879: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/5491 Check if the app is completed before clean it up Key: SPARK-6879 URL: https://issues.apache.org/jira/browse/SPARK-6879 Project: Spark Issue Type: Bug Components: Deploy Reporter: Tao Wang Now history server deletes the directory whichi expires according to its modification time. It is not good for those long-running applicaitons, as they might be deleted before finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6879) Check if the app is completed before clean it up
[ https://issues.apache.org/jira/browse/SPARK-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6879: --- Assignee: Apache Spark Check if the app is completed before clean it up Key: SPARK-6879 URL: https://issues.apache.org/jira/browse/SPARK-6879 Project: Spark Issue Type: Bug Components: Deploy Reporter: Tao Wang Assignee: Apache Spark Now history server deletes the directory whichi expires according to its modification time. It is not good for those long-running applicaitons, as they might be deleted before finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6879) Check if the app is completed before clean it up
[ https://issues.apache.org/jira/browse/SPARK-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6879: --- Assignee: (was: Apache Spark) Check if the app is completed before clean it up Key: SPARK-6879 URL: https://issues.apache.org/jira/browse/SPARK-6879 Project: Spark Issue Type: Bug Components: Deploy Reporter: Tao Wang Now history server deletes the directory whichi expires according to its modification time. It is not good for those long-running applicaitons, as they might be deleted before finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5689) Document what can be run in different YARN modes
[ https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492349#comment-14492349 ] Apache Spark commented on SPARK-5689: - User 'Sephiroth-Lin' has created a pull request for this issue: https://github.com/apache/spark/pull/5490 Document what can be run in different YARN modes Key: SPARK-5689 URL: https://issues.apache.org/jira/browse/SPARK-5689 Project: Spark Issue Type: Documentation Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves We should document what can be run in the different yarn modes. For instances, the interactive shell only work in yarn client mode, recently with https://github.com/apache/spark/pull/3976 users can run python scripts in cluster mode, etc.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5689) Document what can be run in different YARN modes
[ https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5689: --- Assignee: Apache Spark Document what can be run in different YARN modes Key: SPARK-5689 URL: https://issues.apache.org/jira/browse/SPARK-5689 Project: Spark Issue Type: Documentation Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Apache Spark We should document what can be run in the different yarn modes. For instances, the interactive shell only work in yarn client mode, recently with https://github.com/apache/spark/pull/3976 users can run python scripts in cluster mode, etc.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.
[ https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492364#comment-14492364 ] Micael Capitão commented on SPARK-6800: --- The above pull request seem to only fix the upper and lower bounds issue. There is still the intermediate queries issue that may result in repeated rows being fetched from a DB. Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results. -- Key: SPARK-6800 URL: https://issues.apache.org/jira/browse/SPARK-6800 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, Scala 2.10 Reporter: Micael Capitão Having a Derby table with people info (id, name, age) defined like this: {code} val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true val conn = DriverManager.getConnection(jdbcUrl) val stmt = conn.createStatement() stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)) stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)) stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)) stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)) stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)) {code} If I try to read that table from Spark SQL with lower/upper bounds, like this: {code} val people = sqlContext.jdbc(url = jdbcUrl, table = Person, columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10) people.show() {code} I get this result: {noformat} PERSON_ID NAME AGE 3 Ana Rita Costa 12 5 Miguel Costa 15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 1 Armando Carvalho 50 {noformat} Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!). Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following: {code} (0) age 4,0 (1) age = 4 AND age 8,1 (2) age = 8 AND age 12,2 (3) age = 12 AND age 16,3 (4) age = 16 AND age 20,4 (5) age = 20 AND age 24,5 (6) age = 24 AND age 28,6 (7) age = 28 AND age 32,7 (8) age = 32 AND age 36,8 (9) age = 36,9 {code} The last condition ignores the upper bound and the other ones may result in repeated rows being read. Using the JdbcRDD (and converting it to a DataFrame) I would have something like this: {code} val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl), SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10, rs = (rs.getInt(1), rs.getString(2), rs.getInt(3))) val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE) people.show() {code} Resulting in: {noformat} PERSON_ID NAMEAGE 3 Ana Rita Costa 12 5 Miguel Costa15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 {noformat} Which is correct! Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} I've found it generates the following: {code} (0) age = 0 AND age = 3 (1) age = 4 AND age = 7 (2) age = 8 AND age = 11 (3) age = 12 AND age = 15 (4) age = 16 AND age = 19 (5) age = 20 AND age = 23 (6) age = 24 AND age = 27 (7) age = 28 AND age = 31 (8) age = 32 AND age = 35 (9) age = 36 AND age = 40 {code} This is the behaviour I was expecting from the Spark SQL version. Is the Spark SQL version buggy or is this some weird expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492346#comment-14492346 ] Sean Owen commented on SPARK-4783: -- I have a PR ready, but am testing it. I am seeing test failures but am not sure if they're related. You are also welcome to go ahead with a PR if you think you have a handle on it and I can chime in with what I know. System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4783: --- Assignee: Apache Spark System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria Assignee: Apache Spark A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4783: --- Assignee: (was: Apache Spark) System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492393#comment-14492393 ] Apache Spark commented on SPARK-4783: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/5492 System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492618#comment-14492618 ] Yin Huai commented on SPARK-5791: - [~jameszhouyi] Thank you for the update :) For Hive, it also used Parquet in your last run, right? [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD
[ https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492622#comment-14492622 ] Apache Spark commented on SPARK-6880: - User 'pankajarora12' has created a pull request for this issue: https://github.com/apache/spark/pull/5494 Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD -- Key: SPARK-6880 URL: https://issues.apache.org/jira/browse/SPARK-6880 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: CentOs6.0, java7 Reporter: pankaj arora Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDDs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]
Andrew Lee created SPARK-6882: - Summary: Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] Key: SPARK-6882 URL: https://issues.apache.org/jira/browse/SPARK-6882 Project: Spark Issue Type: Bug Affects Versions: 1.3.0, 1.2.1 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled * Apache Hive 0.13.1 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 Reporter: Andrew Lee When Kerberos is enabled, I get the following exceptions. {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] {code} I tried it in * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 with * Apache Hive 0.13.1 * Apache Hadoop 2.4.1 Build command {code} mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install {code} When starting Spark ThriftServer in {{yarn-client}} mode, the command to start thriftserver looks like this {code} ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf hive.server2.thrift.bind.host=$(hostname) --master yarn-client {code} {{hostname}} points to the current hostname of the machine I'm using. Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1) {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56) at org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118) at org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43) at java.lang.Thread.run(Thread.java:744) {code} I'm wondering if this is due to the same problem described in HIVE-8154 HIVE-7620 due to an older code based for the Spark ThriftServer? Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to run against a Kerberos cluster (Apache 2.4.1). My hive-site.xml looks like the following for spark/conf. The kerberos keytab and tgt are configured correctly, I'm able to connect to metastore, but the subsequent steps failed due to the exception. {code} property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.execute.setugi/name valuetrue/value /property property namehive.stats.autogather/name valuefalse/value /property property namehive.session.history.enabled/name valuetrue/value /property property namehive.querylog.location/name value/tmp/home/hive/log/${user.name}/value /property property namehive.exec.local.scratchdir/name value/tmp/hive/scratch/${user.name}/value /property property namehive.metastore.uris/name valuethrift://somehostname:9083/value /property !-- HIVE SERVER 2 -- property namehive.server2.authentication/name valueKERBEROS/value /property property namehive.server2.authentication.kerberos.principal/name value***/value /property property namehive.server2.authentication.kerberos.keytab/name value***/value /property property namehive.server2.thrift.sasl.qop/name valueauth/value descriptionSasl QOP value; one of 'auth', 'auth-int' and 'auth-conf'/description /property property namehive.server2.enable.impersonation/name descriptionEnable user impersonation for HiveServer2/description valuetrue/value /property !-- HIVE METASTORE -- property namehive.metastore.sasl.enabled/name valuetrue/value /property property namehive.metastore.kerberos.keytab.file/name value***/value /property property namehive.metastore.kerberos.principal/name value***/value /property property namehive.metastore.cache.pinobjtypes/name valueTable,Database,Type,FieldSchema,Order/value /property property namehdfs_sentinel_file/name value***/value /property property namehive.metastore.warehouse.dir/name value/hive/value /property property namehive.metastore.client.socket.timeout/name value600/value /property property namehive.warehouse.subdir.inherit.perms/name valuetrue/value /property {code} Here, I'm attaching a more detail logs from Spark 1.3 rc1. {code} 2015-04-13 16:37:20,688 INFO
[jira] [Assigned] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD
[ https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6880: --- Assignee: Apache Spark Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD -- Key: SPARK-6880 URL: https://issues.apache.org/jira/browse/SPARK-6880 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: CentOs6.0, java7 Reporter: pankaj arora Assignee: Apache Spark Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDDs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD
[ https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6880: --- Assignee: (was: Apache Spark) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD -- Key: SPARK-6880 URL: https://issues.apache.org/jira/browse/SPARK-6880 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: CentOs6.0, java7 Reporter: pankaj arora Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDDs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD
[ https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pankaj arora updated SPARK-6880: Description: Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDDs Below is the stack trace 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed; shutting down SparkContext java.util.NoSuchElementException: key not found: 28 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) was:Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDDs Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD -- Key: SPARK-6880 URL: https://issues.apache.org/jira/browse/SPARK-6880 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: CentOs6.0, java7 Reporter: pankaj arora Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDDs Below is the stack trace 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed; shutting down SparkContext java.util.NoSuchElementException: key not found: 28 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781) at
[jira] [Commented] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD
[ https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492642#comment-14492642 ] pankaj arora commented on SPARK-6880: - Sean, Sorry for missing stack trace. Added that in description. Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD -- Key: SPARK-6880 URL: https://issues.apache.org/jira/browse/SPARK-6880 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: CentOs6.0, java7 Reporter: pankaj arora Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDDs Below is the stack trace 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed; shutting down SparkContext java.util.NoSuchElementException: key not found: 28 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6823) Add a model.matrix like capability to DataFrames (modelDataFrame)
[ https://issues.apache.org/jira/browse/SPARK-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492661#comment-14492661 ] Shivaram Venkataraman commented on SPARK-6823: -- I think the goal of the original JIRA on SparkR was to have a high-level API that'll allow users to express this . We could have this higher-level API in a DataFrame or just provide a wrapper around OneHotEncoder + VectorAssembler in the SparkR ML integration work. I think the second one sounds better to me, but [~cafreeman] and Dan Putler have been looking at this and might be able to add more. Add a model.matrix like capability to DataFrames (modelDataFrame) - Key: SPARK-6823 URL: https://issues.apache.org/jira/browse/SPARK-6823 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Shivaram Venkataraman Currently Mllib modeling tools work only with double data. However, data tables in practice often have a set of categorical fields (factors in R), that need to be converted to a set of 0/1 indicator variables (making the data actually used in a modeling algorithm completely numeric). In R, this is handled in modeling functions using the model.matrix function. Similar functionality needs to be available within Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492667#comment-14492667 ] Cheng Lian commented on SPARK-6859: --- [~rdblue] pointed out 1 fact that I missed in PARQUET-251: we need to work out a way to ignore (binary) min/max stats for all existing data. So from Spark SQL side, we have to disable filter push-down for binary columns. Parquet File Binary column statistics error when reuse byte[] among rows Key: SPARK-6859 URL: https://issues.apache.org/jira/browse/SPARK-6859 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Yijie Shen Priority: Minor Suppose I create a dataRDD which extends RDD\[Row\], and each row is GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max min) of Binary column would be wrong. \\ \\ Here is the reason: In Parquet, BinaryStatistic just keep max min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array\[Byte\] passed from row. | |reference| |backed| | |max: Binary|--|ByteArrayBackedBinary|--|Array\[Byte\]| Therefore, each time parquet updating row group's statistic, max min would always refer to the same Array\[Byte\], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max min. \\ \\ It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly. But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]
[ https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6882: - Component/s: SQL Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] Key: SPARK-6882 URL: https://issues.apache.org/jira/browse/SPARK-6882 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled * Apache Hive 0.13.1 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 Reporter: Andrew Lee When Kerberos is enabled, I get the following exceptions. {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] {code} I tried it in * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 with * Apache Hive 0.13.1 * Apache Hadoop 2.4.1 Build command {code} mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install {code} When starting Spark ThriftServer in {{yarn-client}} mode, the command to start thriftserver looks like this {code} ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf hive.server2.thrift.bind.host=$(hostname) --master yarn-client {code} {{hostname}} points to the current hostname of the machine I'm using. Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1) {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56) at org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118) at org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43) at java.lang.Thread.run(Thread.java:744) {code} I'm wondering if this is due to the same problem described in HIVE-8154 HIVE-7620 due to an older code based for the Spark ThriftServer? Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to run against a Kerberos cluster (Apache 2.4.1). My hive-site.xml looks like the following for spark/conf. The kerberos keytab and tgt are configured correctly, I'm able to connect to metastore, but the subsequent steps failed due to the exception. {code} property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.execute.setugi/name valuetrue/value /property property namehive.stats.autogather/name valuefalse/value /property property namehive.session.history.enabled/name valuetrue/value /property property namehive.querylog.location/name value/tmp/home/hive/log/${user.name}/value /property property namehive.exec.local.scratchdir/name value/tmp/hive/scratch/${user.name}/value /property property namehive.metastore.uris/name valuethrift://somehostname:9083/value /property !-- HIVE SERVER 2 -- property namehive.server2.authentication/name valueKERBEROS/value /property property namehive.server2.authentication.kerberos.principal/name value***/value /property property namehive.server2.authentication.kerberos.keytab/name value***/value /property property namehive.server2.thrift.sasl.qop/name valueauth/value descriptionSasl QOP value; one of 'auth', 'auth-int' and 'auth-conf'/description /property property namehive.server2.enable.impersonation/name descriptionEnable user impersonation for HiveServer2/description valuetrue/value /property !-- HIVE METASTORE -- property namehive.metastore.sasl.enabled/name valuetrue/value /property property namehive.metastore.kerberos.keytab.file/name value***/value /property property namehive.metastore.kerberos.principal/name value***/value /property property namehive.metastore.cache.pinobjtypes/name valueTable,Database,Type,FieldSchema,Order/value
[jira] [Resolved] (SPARK-6765) Turn scalastyle on for test code
[ https://issues.apache.org/jira/browse/SPARK-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6765. Resolution: Fixed Fix Version/s: 1.4.0 Turn scalastyle on for test code Key: SPARK-6765 URL: https://issues.apache.org/jira/browse/SPARK-6765 Project: Spark Issue Type: Improvement Components: Project Infra, Tests Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.4.0 We should turn scalastyle on for test code. Test code should be as important as main code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD
pankaj arora created SPARK-6880: --- Summary: Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD Key: SPARK-6880 URL: https://issues.apache.org/jira/browse/SPARK-6880 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: CentOs6.0, java7 Reporter: pankaj arora Fix For: 1.3.2 Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDDs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint
Hao created SPARK-6881: -- Summary: Change the checkpoint directory name from checkpoints to checkpoint Key: SPARK-6881 URL: https://issues.apache.org/jira/browse/SPARK-6881 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Hao Priority: Trivial Name checkpoint instead of checkpoints is included in .gitignore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6881: --- Assignee: Apache Spark Change the checkpoint directory name from checkpoints to checkpoint --- Key: SPARK-6881 URL: https://issues.apache.org/jira/browse/SPARK-6881 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Hao Assignee: Apache Spark Priority: Trivial Name checkpoint instead of checkpoints is included in .gitignore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492548#comment-14492548 ] Apache Spark commented on SPARK-6881: - User 'hlin09' has created a pull request for this issue: https://github.com/apache/spark/pull/5493 Change the checkpoint directory name from checkpoints to checkpoint --- Key: SPARK-6881 URL: https://issues.apache.org/jira/browse/SPARK-6881 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Hao Priority: Trivial Name checkpoint instead of checkpoints is included in .gitignore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6881: --- Assignee: (was: Apache Spark) Change the checkpoint directory name from checkpoints to checkpoint --- Key: SPARK-6881 URL: https://issues.apache.org/jira/browse/SPARK-6881 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Hao Priority: Trivial Name checkpoint instead of checkpoints is included in .gitignore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD
[ https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6880: - Target Version/s: (was: 1.3.2) Fix Version/s: (was: 1.3.2) (Don't assign Target / Fix Version) This is not a valid JIRA, as there is no detail. If you intend to add detail later, OK, but please next time wait until you have all of that information ready before opening a JIRA. Otherwise I'm going to close this. Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD -- Key: SPARK-6880 URL: https://issues.apache.org/jira/browse/SPARK-6880 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: CentOs6.0, java7 Reporter: pankaj arora Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDDs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org