[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731838#comment-14731838 ] Joseph K. Bradley commented on SPARK-9961: -- By "evaluator," I mean the Evaluator types in spark.ml.evaluation, which can be used for model selection. > ML prediction abstractions should have defaultEvaluator fields > -- > > Key: SPARK-9961 > URL: https://issues.apache.org/jira/browse/SPARK-9961 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Predictor and PredictionModel should have abstract defaultEvaluator methods > which return Evaluators. Subclasses like Regressor, Classifier, etc. should > all provide natural evaluators, set to use the correct input columns and > metrics. Concrete classes may later be modified to > The initial implementation should be marked as DeveloperApi since we may need > to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731834#comment-14731834 ] Joseph K. Bradley commented on SPARK-10199: --- I agree that the work required for these changes is large compared to the small gains for most use cases. I could imagine allocating time to get this merged at some point in the future, but I don't think it can be prioritized right now. I'd recommend keeping your code branch for the future, but closing the PR and marking this JIRA to be addressed later. > Avoid using reflections for parquet model save > -- > > Key: SPARK-10199 > URL: https://issues.apache.org/jira/browse/SPARK-10199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Feynman Liang >Priority: Minor > > These items are not high priority since the overhead writing to Parquest is > much greater than for runtime reflections. > Multiple model save/load in MLlib use case classes to infer a schema for the > data frame saved to Parquet. However, inferring a schema from case classes or > tuples uses [runtime > reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] > which is unnecessary since the types are already known at the time `save` is > called. > It would be better to just specify the schema for the data frame directly > using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5337) respect spark.task.cpus when launch executors
[ https://issues.apache.org/jira/browse/SPARK-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731826#comment-14731826 ] Apache Spark commented on SPARK-5337: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/8610 > respect spark.task.cpus when launch executors > - > > Key: SPARK-5337 > URL: https://issues.apache.org/jira/browse/SPARK-5337 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.0.0 >Reporter: Tao Wang > > In standalone mode, we did not respect spark.task.cpus when luanch executors. > Some executors would have not enough cores to launch a single task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731810#comment-14731810 ] George Dittmar edited comment on SPARK-9961 at 9/5/15 5:23 AM: --- Can you expand on what you mean by Evaluator? Just looking for something to eval how good predictions are? was (Author: georgedittmar): Can you expand on what you mean by Evaluator? Just looking for something to eval how good predictions are? > ML prediction abstractions should have defaultEvaluator fields > -- > > Key: SPARK-9961 > URL: https://issues.apache.org/jira/browse/SPARK-9961 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Predictor and PredictionModel should have abstract defaultEvaluator methods > which return Evaluators. Subclasses like Regressor, Classifier, etc. should > all provide natural evaluators, set to use the correct input columns and > metrics. Concrete classes may later be modified to > The initial implementation should be marked as DeveloperApi since we may need > to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields
[ https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731810#comment-14731810 ] George Dittmar commented on SPARK-9961: --- Can you expand on what you mean by Evaluator? Just looking for something to eval how good predictions are? > ML prediction abstractions should have defaultEvaluator fields > -- > > Key: SPARK-9961 > URL: https://issues.apache.org/jira/browse/SPARK-9961 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Predictor and PredictionModel should have abstract defaultEvaluator methods > which return Evaluators. Subclasses like Regressor, Classifier, etc. should > all provide natural evaluators, set to use the correct input columns and > metrics. Concrete classes may later be modified to > The initial implementation should be marked as DeveloperApi since we may need > to change the defaults later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-8632: - Assignee: Davies Liu > Poor Python UDF performance because of RDD caching > -- > > Key: SPARK-8632 > URL: https://issues.apache.org/jira/browse/SPARK-8632 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Justin Uang >Assignee: Davies Liu > > {quote} > We have been running into performance problems using Python UDFs with > DataFrames at large scale. > From the implementation of BatchPythonEvaluation, it looks like the goal was > to reuse the PythonRDD code. It caches the entire child RDD so that it can do > two passes over the data. One to give to the PythonRDD, then one to join the > python lambda results with the original row (which may have java objects that > should be passed through). > In addition, it caches all the columns, even the ones that don't need to be > processed by the Python UDF. In the cases I was working with, I had a 500 > column table, and i wanted to use a python UDF for one column, and it ended > up caching all 500 columns. > {quote} > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10459) PythonUDF could process UnsafeRow
Davies Liu created SPARK-10459: -- Summary: PythonUDF could process UnsafeRow Key: SPARK-10459 URL: https://issues.apache.org/jira/browse/SPARK-10459 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Currently, There will be ConvertToSafe for PythonUDF, that's not needed actually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true
[ https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731757#comment-14731757 ] Vinod KC commented on SPARK-10414: -- Thanks Got the JIRA id https://issues.apache.org/jira/browse/SPARK-9919 > DenseMatrix gives different hashcode even though equals returns true > > > Key: SPARK-10414 > URL: https://issues.apache.org/jira/browse/SPARK-10414 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Vinod KC >Priority: Minor > > hashcode implementation in DenseMatrix gives different result for same input > val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > assert(dm1 === dm) // passed > assert(dm1.hashCode === dm.hashCode) // Failed > This violates the hashCode/equals contract. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731753#comment-14731753 ] Vinod KC commented on SPARK-10199: -- [~mengxr] Thanks for the suggestion. Shall I close the PR? > Avoid using reflections for parquet model save > -- > > Key: SPARK-10199 > URL: https://issues.apache.org/jira/browse/SPARK-10199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Feynman Liang >Priority: Minor > > These items are not high priority since the overhead writing to Parquest is > much greater than for runtime reflections. > Multiple model save/load in MLlib use case classes to infer a schema for the > data frame saved to Parquet. However, inferring a schema from case classes or > tuples uses [runtime > reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] > which is unnecessary since the types are already known at the time `save` is > called. > It would be better to just specify the schema for the data frame directly > using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true
[ https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731745#comment-14731745 ] Vinod KC commented on SPARK-10414: -- [~josephkb] Could you please share me that existing JIRA id to review the PR Thanks > DenseMatrix gives different hashcode even though equals returns true > > > Key: SPARK-10414 > URL: https://issues.apache.org/jira/browse/SPARK-10414 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Vinod KC >Priority: Minor > > hashcode implementation in DenseMatrix gives different result for same input > val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > assert(dm1 === dm) // passed > assert(dm1.hashCode === dm.hashCode) // Failed > This violates the hashCode/equals contract. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7257) Find nearest neighbor satisfying predicate
[ https://issues.apache.org/jira/browse/SPARK-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731744#comment-14731744 ] Luvsandondov Lkhamsuren commented on SPARK-7257: This sounds very interesting! If I understood correctly, having multiple vertices satisfying the predicate (let's call the set P, which is a subset of V), and we want to find set of vertices from the P that is closest. Is it guaranteed that |P| << |V|? What is the use case you'd in mind [~josephkb]? > Find nearest neighbor satisfying predicate > -- > > Key: SPARK-7257 > URL: https://issues.apache.org/jira/browse/SPARK-7257 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Joseph K. Bradley >Priority: Minor > > It would be useful to be able to find nearest neighbors satisfying > predicates. E.g.: > * Given one or more starting vertices, plus a predicate. > * Find the closest vertex or vertices satisfying the predicate. > This is different from ShortestPaths in that ShortestPaths searches for a > fixed (small) set of vertices, rather than all vertices satisfying a > predicate (which could be a large set). > It could be implemented using BFS from the initial vertex/vertices, though > faster implementations might also search from vertices satisfying the > predicate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9925) Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests
[ https://issues.apache.org/jira/browse/SPARK-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-9925. -- Resolution: Fixed Fix Version/s: 1.6.0 > Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests > -- > > Key: SPARK-9925 > URL: https://issues.apache.org/jira/browse/SPARK-9925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.6.0 > > > Right now, in our TestSQLContext/TestHiveContext, we use {{override def > numShufflePartitions: Int = this.getConf(SQLConf.SHUFFLE_PARTITIONS, 5)}} to > set {{SHUFFLE_PARTITIONS}}. However, we never put it to SQLConf. So, after we > use {{withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "number")}}, the number > of shuffle partitions will be set back to 200. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9963: --- Assignee: (was: Apache Spark) > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731702#comment-14731702 ] Apache Spark commented on SPARK-9963: - User 'lkhamsurenl' has created a pull request for this issue: https://github.com/apache/spark/pull/8609 > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9963: --- Assignee: Apache Spark > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8630) Prevent from checkpointing QueueInputDStream
[ https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731671#comment-14731671 ] Tathagata Das edited comment on SPARK-8630 at 9/5/15 12:50 AM: --- @ all in this thread: Are you having this problem with 1.4.1 or 1.5.0 (in RC3)? was (Author: tdas): @ all in this thread: Are you having this problem with 1.4.1 or 1.5.0? > Prevent from checkpointing QueueInputDStream > > > Key: SPARK-8630 > URL: https://issues.apache.org/jira/browse/SPARK-8630 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.4.1, 1.5.0 > > > It's better to prevent from checkpointing QueueInputDStream rather than > failing the application when recovering `QueueInputDStream`, so that people > can find the issue as soon as possible. See SPARK-8553 for example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8630) Prevent from checkpointing QueueInputDStream
[ https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731671#comment-14731671 ] Tathagata Das commented on SPARK-8630: -- @ all in this thread: Are you having this problem with 1.4.1 or 1.5.0? > Prevent from checkpointing QueueInputDStream > > > Key: SPARK-8630 > URL: https://issues.apache.org/jira/browse/SPARK-8630 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.4.1, 1.5.0 > > > It's better to prevent from checkpointing QueueInputDStream rather than > failing the application when recovering `QueueInputDStream`, so that people > can find the issue as soon as possible. See SPARK-8553 for example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9235) PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting as driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731662#comment-14731662 ] bijaya edited comment on SPARK-9235 at 9/5/15 12:41 AM: SPARK_YARN_USER_ENV="PYSPARK_PYTHON= for example. spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON= --num-executors 2 --driver-memory 2g --executor-memory 1g --executor-cores 1 .py was (Author: bijaya): SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/srv/software/anaconda/bin/python" works if we are submitting job to yarn_client mode for yarn_cluster mode we have to insert env variable via spark.yarn.appMasterEnv. for example. spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON= --num-executors 2 --driver-memory 2g --executor-memory 1g --executor-cores 1 .py > PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting > as driver in yarn-cluster mode > > > Key: SPARK-9235 > URL: https://issues.apache.org/jira/browse/SPARK-9235 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 > Environment: CentOS 6.6, python 2.7, Spark 1.4.1 tagged version, YARN > Cluster Manager, CDH 5.4.1 (Hadoop 2.6.0++), Java 1.7 >Reporter: Aaron Glahe >Priority: Minor > > Relates to SPARK-9229 > Env: Spark on YARN, Java 1.7, Centos 6.6, CDH 5.4.1 (Hadoop 2.6.0++), > Anaconda Python 2.7.10 "installed" in /srv/software directory > On a client/submitting machine, we set the PYSPARK_DRIVER_PYTHON env var in > spark-env.sh that pointed the anaconda python executable, which was on every > YARN node: > export PYSPARK_DRIVER_PYTHON='/srv/software/anaconda/bin/python' > side note, export PYSPARK_PYTHON='/srv/software/anaconda/bin/python' was set > as well in the spark-env.sh. > run the command: > spark-submit test.py --master yarn --deploy-mode cluster > It appears as though the Node Manager with the DRIVER does not use the > PYSPARK_DRIVER_PYTHON env python, but instead uses the CentOS system default > (which in this case is python 2.6). > Workaround appears to setting the python path in the SPARK_YARN_USER_ENV -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9235) PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting as driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731662#comment-14731662 ] bijaya commented on SPARK-9235: --- SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/srv/software/anaconda/bin/python" works if we are submitting job to yarn_client mode for yarn_cluster mode we have to insert env variable via spark.yarn.appMasterEnv. for example. spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON= --num-executors 2 --driver-memory 2g --executor-memory 1g --executor-cores 1 .py > PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting > as driver in yarn-cluster mode > > > Key: SPARK-9235 > URL: https://issues.apache.org/jira/browse/SPARK-9235 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.0 > Environment: CentOS 6.6, python 2.7, Spark 1.4.1 tagged version, YARN > Cluster Manager, CDH 5.4.1 (Hadoop 2.6.0++), Java 1.7 >Reporter: Aaron Glahe >Priority: Minor > > Relates to SPARK-9229 > Env: Spark on YARN, Java 1.7, Centos 6.6, CDH 5.4.1 (Hadoop 2.6.0++), > Anaconda Python 2.7.10 "installed" in /srv/software directory > On a client/submitting machine, we set the PYSPARK_DRIVER_PYTHON env var in > spark-env.sh that pointed the anaconda python executable, which was on every > YARN node: > export PYSPARK_DRIVER_PYTHON='/srv/software/anaconda/bin/python' > side note, export PYSPARK_PYTHON='/srv/software/anaconda/bin/python' was set > as well in the spark-env.sh. > run the command: > spark-submit test.py --master yarn --deploy-mode cluster > It appears as though the Node Manager with the DRIVER does not use the > PYSPARK_DRIVER_PYTHON env python, but instead uses the CentOS system default > (which in this case is python 2.6). > Workaround appears to setting the python path in the SPARK_YARN_USER_ENV -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10402) Add scaladoc for default values of params in ML
[ https://issues.apache.org/jira/browse/SPARK-10402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10402. --- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Issue resolved by pull request 8591 [https://github.com/apache/spark/pull/8591] > Add scaladoc for default values of params in ML > --- > > Key: SPARK-10402 > URL: https://issues.apache.org/jira/browse/SPARK-10402 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 1.6.0, 1.5.1 > > > We should make sure the scaladoc for params includes their default values > through the models in ml/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping
Matt Cheah created SPARK-10458: -- Summary: Would like to know if a given Spark Context is stopped or currently stopping Key: SPARK-10458 URL: https://issues.apache.org/jira/browse/SPARK-10458 Project: Spark Issue Type: Improvement Reporter: Matt Cheah Priority: Minor I ran into a case where a thread stopped a Spark Context, specifically when I hit the "kill" link from the Spark standalone UI. There was no real way for another thread to know that the context had stopped and thus should have handled that accordingly. Checking that the SparkEnv is null is one way, but that doesn't handle the case where the context is in the midst of stopping, and stopping the context may actually not be instantaneous - in my case for some reason the DAGScheduler was taking a non-trivial amount of time to stop. Implementation wise I'm more or less requesting the boolean value returned from SparkContext.stopped.get() to be visible in some way. As long as we return the value and not the Atomic Boolean itself (we wouldn't want anyone to be setting this, after all!) it would help client applications check the context's liveliness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"
[ https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10397: Assignee: Apache Spark > Make Python's SparkContext self-descriptive on "print sc" > - > > Key: SPARK-10397 > URL: https://issues.apache.org/jira/browse/SPARK-10397 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Sergey Tryuber >Assignee: Apache Spark >Priority: Trivial > > When I execute in Python shell: > {code} > print sc > {code} > I receive something like: > {noformat} > > {noformat} > But this is very inconvenient, especially if a user wants to create a > good-looking and self-descriptive IPython Notebook. He would like to see some > information about his Spark cluster. > In contrast, H2O context does have this feature and it is very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"
[ https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10397: Assignee: (was: Apache Spark) > Make Python's SparkContext self-descriptive on "print sc" > - > > Key: SPARK-10397 > URL: https://issues.apache.org/jira/browse/SPARK-10397 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Sergey Tryuber >Priority: Trivial > > When I execute in Python shell: > {code} > print sc > {code} > I receive something like: > {noformat} > > {noformat} > But this is very inconvenient, especially if a user wants to create a > good-looking and self-descriptive IPython Notebook. He would like to see some > information about his Spark cluster. > In contrast, H2O context does have this feature and it is very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"
[ https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731618#comment-14731618 ] Apache Spark commented on SPARK-10397: -- User 'alexrovner' has created a pull request for this issue: https://github.com/apache/spark/pull/8608 > Make Python's SparkContext self-descriptive on "print sc" > - > > Key: SPARK-10397 > URL: https://issues.apache.org/jira/browse/SPARK-10397 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Sergey Tryuber >Priority: Trivial > > When I execute in Python shell: > {code} > print sc > {code} > I receive something like: > {noformat} > > {noformat} > But this is very inconvenient, especially if a user wants to create a > good-looking and self-descriptive IPython Notebook. He would like to see some > information about his Spark cluster. > In contrast, H2O context does have this feature and it is very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"
[ https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731619#comment-14731619 ] Alex Rovner commented on SPARK-10397: - Pull: https://github.com/apache/spark/pull/8608 {noformat} >>> sc {'_accumulatorServer': , '_batchSize': 0, '_callsite': CallSite(function='', file='/Users/alex.rovner/git/spark/python/pyspark/shell.py', linenum=43), '_conf': {'_jconf': JavaObject id=o0}, '_javaAccumulator': JavaObject id=o11, '_jsc': JavaObject id=o8, '_pickled_broadcast_vars': set([]), '_python_includes': [], '_temp_dir': u'/private/var/folders/hj/v4zb0_f159q8mt4w3j8m2_mrgp/T/spark-a9cc47a9-db90-49a3-a82e-263f0b56268c/pyspark-773c7490-2b2d-4418-a030-256a5b9c1fe1', '_unbatched_serializer': PickleSerializer(), 'appName': u'PySparkShell', 'environment': {}, 'master': u'local[*]', 'profiler_collector': None, 'pythonExec': 'python2.7', 'pythonVer': '2.7', 'serializer': AutoBatchedSerializer(PickleSerializer()), 'sparkHome': None} >>> print sc {'_accumulatorServer': , '_batchSize': 0, '_callsite': CallSite(function='', file='/Users/alex.rovner/git/spark/python/pyspark/shell.py', linenum=43), '_conf': {'_jconf': JavaObject id=o0}, '_javaAccumulator': JavaObject id=o11, '_jsc': JavaObject id=o8, '_pickled_broadcast_vars': set([]), '_python_includes': [], '_temp_dir': u'/private/var/folders/hj/v4zb0_f159q8mt4w3j8m2_mrgp/T/spark-a9cc47a9-db90-49a3-a82e-263f0b56268c/pyspark-773c7490-2b2d-4418-a030-256a5b9c1fe1', '_unbatched_serializer': PickleSerializer(), 'appName': u'PySparkShell', 'environment': {}, 'master': u'local[*]', 'profiler_collector': None, 'pythonExec': 'python2.7', 'pythonVer': '2.7', 'serializer': AutoBatchedSerializer(PickleSerializer()), 'sparkHome': None} >>> {noformat} > Make Python's SparkContext self-descriptive on "print sc" > - > > Key: SPARK-10397 > URL: https://issues.apache.org/jira/browse/SPARK-10397 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Sergey Tryuber >Priority: Trivial > > When I execute in Python shell: > {code} > print sc > {code} > I receive something like: > {noformat} > > {noformat} > But this is very inconvenient, especially if a user wants to create a > good-looking and self-descriptive IPython Notebook. He would like to see some > information about his Spark cluster. > In contrast, H2O context does have this feature and it is very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731598#comment-14731598 ] Justin Uang commented on SPARK-10447: - Sound good > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731597#comment-14731597 ] holdenk commented on SPARK-10447: - Sure, I'll ping you when I've got the PR ready (probably sometime this long weekend) if that's good for you? > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10436) spark-submit overwrites spark.files defaults with the job script filename
[ https://issues.apache.org/jira/browse/SPARK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731593#comment-14731593 ] holdenk commented on SPARK-10436: - I've worked on some of the older versions of the submit scripts and should probably refresh my knowledge, I can tackle this unless someone else is already planning on :) > spark-submit overwrites spark.files defaults with the job script filename > - > > Key: SPARK-10436 > URL: https://issues.apache.org/jira/browse/SPARK-10436 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.4.0 > Environment: Ubuntu, Spark 1.4.0 Standalone >Reporter: axel dahl >Priority: Minor > Labels: easyfix, feature > > In my spark-defaults.conf I have configured a set of libararies to be > uploaded to my Spark 1.4.0 Standalone cluster. The entry appears as: > spark.files libarary.zip,file1.py,file2.py > When I execute spark-submit -v test.py > I see that spark-submit reads the defaults correctly, but that it overwrites > the "spark.files" default entry and replaces it with the name if the job > script, i.e. "test.py". > This behavior doesn't seem intuitive. test.py, should be added to the spark > working folder, but it should not overwrite the "spark.files" defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731592#comment-14731592 ] Justin Uang commented on SPARK-10447: - Sure, I wouldn't mind doing the code review. Can you add me? > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731591#comment-14731591 ] holdenk commented on SPARK-10447: - I can give this a shot if no one else is interested in doing this (I've been wrangling some py4j bits with Sparkling Pandas). > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10013) Remove Java assert from Java unit tests
[ https://issues.apache.org/jira/browse/SPARK-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10013: Assignee: Apache Spark > Remove Java assert from Java unit tests > --- > > Key: SPARK-10013 > URL: https://issues.apache.org/jira/browse/SPARK-10013 > Project: Spark > Issue Type: Test > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > We should use assertTrue, etc. instead to make sure the asserts are not > ignored in tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10013) Remove Java assert from Java unit tests
[ https://issues.apache.org/jira/browse/SPARK-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10013: Assignee: (was: Apache Spark) > Remove Java assert from Java unit tests > --- > > Key: SPARK-10013 > URL: https://issues.apache.org/jira/browse/SPARK-10013 > Project: Spark > Issue Type: Test > Components: ML, MLlib >Reporter: Joseph K. Bradley > > We should use assertTrue, etc. instead to make sure the asserts are not > ignored in tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10013) Remove Java assert from Java unit tests
[ https://issues.apache.org/jira/browse/SPARK-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731585#comment-14731585 ] Apache Spark commented on SPARK-10013: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/8607 > Remove Java assert from Java unit tests > --- > > Key: SPARK-10013 > URL: https://issues.apache.org/jira/browse/SPARK-10013 > Project: Spark > Issue Type: Test > Components: ML, MLlib >Reporter: Joseph K. Bradley > > We should use assertTrue, etc. instead to make sure the asserts are not > ignored in tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8630) Prevent from checkpointing QueueInputDStream
[ https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731556#comment-14731556 ] Tathagata Das commented on SPARK-8630: -- I get your point. I guess it will be good to revert this patch and just generate a WARNING. A more principled solution would be allow users to separately enable RDD checkpointing and DStream checkpointing, so that they can only enable RDD checkpointing and keep DStream checkpointing disabled (for queueStream to work). > Prevent from checkpointing QueueInputDStream > > > Key: SPARK-8630 > URL: https://issues.apache.org/jira/browse/SPARK-8630 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.4.1, 1.5.0 > > > It's better to prevent from checkpointing QueueInputDStream rather than > failing the application when recovering `QueueInputDStream`, so that people > can find the issue as soon as possible. See SPARK-8553 for example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new
[ https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10311: -- Assignee: meiyoula > In cluster mode, AppId and AttemptId should be update when ApplicationMaster > is new > --- > > Key: SPARK-10311 > URL: https://issues.apache.org/jira/browse/SPARK-10311 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: meiyoula >Assignee: meiyoula > Fix For: 1.6.0, 1.5.1 > > > When I start a streaming app with checkpoint data in yarn-cluster mode, the > appId and attempId are old(which app first create the checkpoint data), and > the event log writes into the old file name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new
[ https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-10311. --- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 > In cluster mode, AppId and AttemptId should be update when ApplicationMaster > is new > --- > > Key: SPARK-10311 > URL: https://issues.apache.org/jira/browse/SPARK-10311 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: meiyoula > Fix For: 1.6.0, 1.5.1 > > > When I start a streaming app with checkpoint data in yarn-cluster mode, the > appId and attempId are old(which app first create the checkpoint data), and > the event log writes into the old file name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10420) Implementing Reactive Streams based Spark Streaming Receiver
[ https://issues.apache.org/jira/browse/SPARK-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10420: -- Target Version/s: 1.6.0 (was: ) > Implementing Reactive Streams based Spark Streaming Receiver > > > Key: SPARK-10420 > URL: https://issues.apache.org/jira/browse/SPARK-10420 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Nilanjan Raychaudhuri >Priority: Minor > > Hello TD, > This is probably the last bit of the back-pressure story, implementing > ReactiveStreams based Spark streaming receivers. After discussing about this > with my Typesafe team we came up with the following design document > https://docs.google.com/document/d/1lGQKXfNznd5SPuQigvCdLsudl-gcvWKuHWr0Bpn3y30/edit?usp=sharing > Could you please take a look at this when you get a chance? > Thanks > Nilanjan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10433) Gradient boosted trees
[ https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731533#comment-14731533 ] DB Tsai commented on SPARK-10433: - [~sowen] I can confirm that this should be fixed in 1.5 > Gradient boosted trees > -- > > Key: SPARK-10433 > URL: https://issues.apache.org/jira/browse/SPARK-10433 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1, 1.5.0 >Reporter: Sean Owen > > (Sorry to say I don't have any leads on a fix, but this was reported by three > different people and I confirmed it at fairly close range, so think it's > legitimate:) > This is probably best explained in the words from the mailing list thread at > http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E > . Matt Forbes says: > {quote} > I am training a boosted trees model on a couple million input samples (with > around 300 features) and am noticing that the input size of each stage is > increasing each iteration. For each new tree, the first step seems to be > building the decision tree metadata, which does a .count() on the input data, > so this is the step I've been using to track the input size changing. Here is > what I'm seeing: > {quote} > {code} > count at DecisionTreeMetadata.scala:111 > 1. Input Size / Records: 726.1 MB / 1295620 > 2. Input Size / Records: 106.9 GB / 64780816 > 3. Input Size / Records: 160.3 GB / 97171224 > 4. Input Size / Records: 214.8 GB / 129680959 > 5. Input Size / Records: 268.5 GB / 162533424 > > Input Size / Records: 1912.6 GB / 1382017686 > > {code} > {quote} > This step goes from taking less than 10s up to 5 minutes by the 15th or so > iteration. I'm not quite sure what could be causing this. I am passing a > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > {quote} > Johannes Bauer showed me a very similar problem. > Peter Rudenko offers this sketch of a reproduction: > {code} > val boostingStrategy = BoostingStrategy.defaultParams("Classification") > boostingStrategy.setNumIterations(30) > boostingStrategy.setLearningRate(1.0) > boostingStrategy.treeStrategy.setMaxDepth(3) > boostingStrategy.treeStrategy.setMaxBins(128) > boostingStrategy.treeStrategy.setSubsamplingRate(1.0) > boostingStrategy.treeStrategy.setMinInstancesPerNode(1) > boostingStrategy.treeStrategy.setUseNodeIdCache(true) > boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( > > mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, > java.lang.Integer]]) > val model = GradientBoostedTrees.train(instances, boostingStrategy) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10304: -- Target Version/s: (was: 1.6.0, 1.5.1) > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang >Priority: Critical > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10304: -- Target Version/s: 1.6.0, 1.5.1 (was: 1.5.1,1.6.0) > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang >Priority: Critical > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10304: -- Target Version/s: 1.6.0, 1.5.1 > Partition discovery does not throw an exception if the dir structure is > invalid > --- > > Key: SPARK-10304 > URL: https://issues.apache.org/jira/browse/SPARK-10304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Zhan Zhang >Priority: Critical > > I have a dir structure like {{/path/table1/partition_column=1/}}. When I try > to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if > it is stored as ORC, there will be the following NPE. But, if it is Parquet, > we even can return rows. We should complain to users about the dir struct > because {{table1}} does not meet our format. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in > stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 > (TID 3504, 10.0.195.227): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) > at > org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10454. --- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Target Version/s: 1.6.0, 1.5.1 > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Assignee: Pete Robbins >Priority: Critical > Labels: flaky-test > Fix For: 1.6.0, 1.5.1 > > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10454: -- Priority: Critical (was: Minor) > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Priority: Critical > Labels: flaky-test > Fix For: 1.6.0, 1.5.1 > > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10454: -- Assignee: Pete Robbins > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Assignee: Pete Robbins >Priority: Critical > Labels: flaky-test > Fix For: 1.6.0, 1.5.1 > > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10454: -- Labels: flaky-test (was: ) > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Assignee: Pete Robbins >Priority: Critical > Labels: flaky-test > Fix For: 1.6.0, 1.5.1 > > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9669) Support PySpark with Mesos Cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-9669. -- Resolution: Fixed Fix Version/s: 1.6.0 > Support PySpark with Mesos Cluster mode > --- > > Key: SPARK-9669 > URL: https://issues.apache.org/jira/browse/SPARK-9669 > Project: Spark > Issue Type: New Feature > Components: Mesos, PySpark >Affects Versions: 1.5.0 >Reporter: Timothy Chen >Assignee: Timothy Chen > Fix For: 1.6.0 > > > PySpark with cluster mode with Mesos is not yet supported. > We need to enable it and make sure it's able to launch Pyspark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10450) Minor SQL style, format, typo, readability fixes
[ https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10450. --- Resolution: Fixed Fix Version/s: 1.6.0 > Minor SQL style, format, typo, readability fixes > > > Key: SPARK-10450 > URL: https://issues.apache.org/jira/browse/SPARK-10450 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 1.6.0 > > > This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's > more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731529#comment-14731529 ] Xiangrui Meng commented on SPARK-10199: --- The improvement numbers also depends on the model size. In unit tests, the model sizes are usually very small. Then the overhead of reflection becomes significant. With real models, it could be either the model itself is too small or the model is large and then the overhead of reflection becomes insignificant. Keeping the code simple and easy to understand is also quite important. +[~josephkb] > Avoid using reflections for parquet model save > -- > > Key: SPARK-10199 > URL: https://issues.apache.org/jira/browse/SPARK-10199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Feynman Liang >Priority: Minor > > These items are not high priority since the overhead writing to Parquest is > much greater than for runtime reflections. > Multiple model save/load in MLlib use case classes to infer a schema for the > data frame saved to Parquet. However, inferring a schema from case classes or > tuples uses [runtime > reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] > which is unnecessary since the types are already known at the time `save` is > called. > It would be better to just specify the schema for the data frame directly > using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve
[ https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10176: -- Fix Version/s: (was: 1.5.0) 1.6.0 > Show partially analyzed plan when checkAnswer df fails to resolve > - > > Key: SPARK-10176 > URL: https://issues.apache.org/jira/browse/SPARK-10176 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.6.0 > > > It would be much easier to debug test failures if we could see the failed > plan instead of just the user friendly error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve
[ https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10176. --- Resolution: Fixed Fix Version/s: 1.5.0 > Show partially analyzed plan when checkAnswer df fails to resolve > - > > Key: SPARK-10176 > URL: https://issues.apache.org/jira/browse/SPARK-10176 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.5.0 > > > It would be much easier to debug test failures if we could see the failed > plan instead of just the user friendly error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve
[ https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10176: -- Target Version/s: 1.6.0 (was: 1.5.0) > Show partially analyzed plan when checkAnswer df fails to resolve > - > > Key: SPARK-10176 > URL: https://issues.apache.org/jira/browse/SPARK-10176 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.6.0 > > > It would be much easier to debug test failures if we could see the failed > plan instead of just the user friendly error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new
[ https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10311: -- Affects Version/s: 1.5.0 1.4.1 > In cluster mode, AppId and AttemptId should be update when ApplicationMaster > is new > --- > > Key: SPARK-10311 > URL: https://issues.apache.org/jira/browse/SPARK-10311 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: meiyoula > > When I start a streaming app with checkpoint data in yarn-cluster mode, the > appId and attempId are old(which app first create the checkpoint data), and > the event log writes into the old file name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new
[ https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10311: -- Target Version/s: 1.6.0, 1.5.1 > In cluster mode, AppId and AttemptId should be update when ApplicationMaster > is new > --- > > Key: SPARK-10311 > URL: https://issues.apache.org/jira/browse/SPARK-10311 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.0 >Reporter: meiyoula > > When I start a streaming app with checkpoint data in yarn-cluster mode, the > appId and attempId are old(which app first create the checkpoint data), and > the event log writes into the old file name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10457) Unable to connect to MySQL with the DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mariano Simone closed SPARK-10457. -- Resolution: Fixed Found the solution. spark.executor.extraClassPath needed configuration. > Unable to connect to MySQL with the DataFrame API > - > > Key: SPARK-10457 > URL: https://issues.apache.org/jira/browse/SPARK-10457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Linux singularity 3.13.0-63-generic #103-Ubuntu SMP Fri > Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > "org.apache.spark" %% "spark-core"% "1.4.1" % "provided", > "org.apache.spark" % "spark-sql_2.10"% "1.4.1" % "provided", > "org.apache.spark" % "spark-streaming_2.10" % "1.4.1" % "provided", > "org.apache.spark" %% "spark-streaming-kafka" % "1.4.1", > "mysql"% "mysql-connector-java" % "5.1.36" >Reporter: Mariano Simone > > I'm getting this error everytime I try to create a dataframe using jdbc: > java.sql.SQLException: No suitable driver found for > jdbc:mysql://localhost:3306/test > What I have so far: > standart sbt project. > Added the dep. on mysql-connector to build.sbt like this: > "mysql"% "mysql-connector-java" % "5.1.36" > The code that creates the df: > val url = "jdbc:mysql://localhost:3306/test" > val table = "test_table" > val properties = new Properties > properties.put("user", "123") > properties.put("password", "123") > properties.put("driver", "com.mysql.jdbc.Driver") > val tiers = sqlContext.read.jdbc(url, table, properties) > I also loaded the jar like this: > streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar") > This is the back trace of the exception being thrown: > 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job > 144140266 ms.0 > java.sql.SQLException: No suitable driver found for > jdbc:mysql://localhost:3306/test > at java.sql.DriverManager.getConnection(DriverManager.java:689) > at java.sql.DriverManager.getConnection(DriverManager.java:208) > at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) > at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) > at com.playtika.etl.Application$.processRDD(Application.scala:69) > at > com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52) > at > com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Let me know if more data is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10457) Unable to connect to MySQL with the DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mariano Simone updated SPARK-10457: --- Description: I'm getting this error everytime I try to create a dataframe using jdbc: java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test What I have so far: standart sbt project. Added the dep. on mysql-connector to build.sbt like this: "mysql"% "mysql-connector-java" % "5.1.36" The code that creates the df: val url = "jdbc:mysql://localhost:3306/test" val table = "test_table" val properties = new Properties properties.put("user", "123") properties.put("password", "123") properties.put("driver", "com.mysql.jdbc.Driver") val tiers = sqlContext.read.jdbc(url, table, properties) I also loaded the jar like this: streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar") This is the back trace of the exception being thrown: 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 144140266 ms.0 java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) at com.playtika.etl.Application$.processRDD(Application.scala:69) at com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52) at com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Let me know if more data is needed. was: I'm getting this error everytime I try to create a dataframe using jdbc: java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test What I have so far: standart sbt project. Added the dep. on mysql-connector to build.sbt like this: "mysql"% "mysql-connector-java" % "5.1.36" The code that creates the df: val url = "jdbc:mysql://localhost:3306/test" val table = "test_table" val properties = new Properties properties.put("user", "123") properties.put("password", "123") properties.put("driver", "com.mysql.jdbc.Driver") val tiers = sqlContext.read.jdbc(url, table, properties) I also loaded the jar like this: streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar") This is the back trace of the exception being thrown: 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 144140266 ms.0 java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128) at org.apache.spark.
[jira] [Created] (SPARK-10457) Unable to connect to MySQL with the DataFrame API
Mariano Simone created SPARK-10457: -- Summary: Unable to connect to MySQL with the DataFrame API Key: SPARK-10457 URL: https://issues.apache.org/jira/browse/SPARK-10457 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Environment: Linux singularity 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) "org.apache.spark" %% "spark-core"% "1.4.1" % "provided", "org.apache.spark" % "spark-sql_2.10"% "1.4.1" % "provided", "org.apache.spark" % "spark-streaming_2.10" % "1.4.1" % "provided", "org.apache.spark" %% "spark-streaming-kafka" % "1.4.1", "mysql"% "mysql-connector-java" % "5.1.36" Reporter: Mariano Simone I'm getting this error everytime I try to create a dataframe using jdbc: java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test What I have so far: standart sbt project. Added the dep. on mysql-connector to build.sbt like this: "mysql"% "mysql-connector-java" % "5.1.36" The code that creates the df: val url = "jdbc:mysql://localhost:3306/test" val table = "test_table" val properties = new Properties properties.put("user", "123") properties.put("password", "123") properties.put("driver", "com.mysql.jdbc.Driver") val tiers = sqlContext.read.jdbc(url, table, properties) I also loaded the jar like this: streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar") This is the back trace of the exception being thrown: 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 144140266 ms.0 java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/test at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) at com.playtika.etl.Application$.processRDD(Application.scala:69) at com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52) at com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731436#comment-14731436 ] shane knapp edited comment on SPARK-10456 at 9/4/15 9:46 PM: - looks like we'll be installing 7u79 (we're at 7u71 currently). was (Author: shaneknapp): looks like we'll be installing 7u79 (we're at 7u51 currently). > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest&greatest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp closed SPARK-10455. --- FIN! > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp resolved SPARK-10455. - Resolution: Done > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10452) Pyspark worker security issue
[ https://issues.apache.org/jira/browse/SPARK-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731332#comment-14731332 ] Marcelo Vanzin edited comment on SPARK-10452 at 9/4/15 9:43 PM: If you need your workers to run as your user, you need to configure YARN to use Kerberos. was (Author: vanzin): If you need your workers to run as you user, you need to configure YARN to use Kerberos. > Pyspark worker security issue > - > > Key: SPARK-10452 > URL: https://issues.apache.org/jira/browse/SPARK-10452 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 > Environment: Spark 1.4.0 running on hadoop 2.5.2. >Reporter: Michael Procopio >Priority: Critical > > The python worker launched by the executor is given the credentials used to > launch yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731480#comment-14731480 ] shane knapp commented on SPARK-10456: - ok, 79 is installed but i will wait until downtime to switch the symlinks over. here's the command i will be running when that time comes: pssh -h jenkins_workers.txt "cd /usr/java; rm -f latest; rm -f default; ln -s jdk1.7.0_79 latest; ln -s latest default > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest&greatest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10402) Add scaladoc for default values of params in ML
[ https://issues.apache.org/jira/browse/SPARK-10402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10402: -- Shepherd: Joseph K. Bradley Assignee: holdenk Target Version/s: 1.6.0, 1.5.1 > Add scaladoc for default values of params in ML > --- > > Key: SPARK-10402 > URL: https://issues.apache.org/jira/browse/SPARK-10402 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: holdenk >Assignee: holdenk >Priority: Minor > > We should make sure the scaladoc for params includes their default values > through the models in ml/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731478#comment-14731478 ] shane knapp commented on SPARK-10455: - it's installed in: /usr/java/jdk1.8.0_60 i'll email the dev@ list and let everyone know. > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731468#comment-14731468 ] Joseph K. Bradley commented on SPARK-9963: -- Yep, that first case in the if-else is for the right-most bin with range [maxSplitValue, +inf] > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731462#comment-14731462 ] Joseph K. Bradley commented on SPARK-9963: -- Sorry for the slow response! (I've been traveling.) Option 2 sounds best. It can resemble the current predictImpl, but can use the version of shouldGoLeft taking binned feature values. > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10456: --- Assignee: shane knapp > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest&greatest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731440#comment-14731440 ] Josh Rosen commented on SPARK-10455: Yep, I think we want the 64-bit version. > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10455: --- Assignee: shane knapp > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp >Assignee: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731436#comment-14731436 ] shane knapp commented on SPARK-10456: - looks like we'll be installing 7u79 (we're at 7u51 currently). > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest&greatest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731428#comment-14731428 ] shane knapp commented on SPARK-10455: - looks like i'll be installing java 8u60. > install java 8 on amplab jenkins workers > > > Key: SPARK-10455 > URL: https://issues.apache.org/jira/browse/SPARK-10455 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp > > install java 8 on all jenkins workers. > and just for clarification: we want the 64-bit version, yes? > please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10456) upgrade java 7 on amplab jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp updated SPARK-10456: Description: our java 7 installation is really old (from last september). update this to the latest&greatest java 7 jdk. please assign this to me. was:our java 7 installation is really old (from last september). update this to the latest&greatest java 7 jdk > upgrade java 7 on amplab jenkins workers > > > Key: SPARK-10456 > URL: https://issues.apache.org/jira/browse/SPARK-10456 > Project: Spark > Issue Type: Task > Components: Build >Reporter: shane knapp > Labels: build > > our java 7 installation is really old (from last september). update this to > the latest&greatest java 7 jdk. > please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10456) upgrade java 7 on amplab jenkins workers
shane knapp created SPARK-10456: --- Summary: upgrade java 7 on amplab jenkins workers Key: SPARK-10456 URL: https://issues.apache.org/jira/browse/SPARK-10456 Project: Spark Issue Type: Task Components: Build Reporter: shane knapp our java 7 installation is really old (from last september). update this to the latest&greatest java 7 jdk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10455) install java 8 on amplab jenkins workers
shane knapp created SPARK-10455: --- Summary: install java 8 on amplab jenkins workers Key: SPARK-10455 URL: https://issues.apache.org/jira/browse/SPARK-10455 Project: Spark Issue Type: Task Components: Build Reporter: shane knapp install java 8 on all jenkins workers. and just for clarification: we want the 64-bit version, yes? please assign this to me, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731415#comment-14731415 ] Imran Rashid commented on SPARK-4105: - [~mvherweg] Do you know if the error occurred after there was already a stage retry? If so, then this might just be a symptom of SPARK-8029. You would know if earlier in the logs, you see a FetchFailedException which is *not* related to snappy exceptions. I think that is the first report of this bug since SPARK-7660, which we were really hoping fixed this issue, so it would be great to capture more information about it. [~mmitsuto] Can you do the same check, and also tell us which version of Spark you are using? > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Attachments: JavaObjectToSerialize.java, > SparkFailedToUncompressGenerator.scala > > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at >
[jira] [Commented] (SPARK-10433) Gradient boosted trees
[ https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731398#comment-14731398 ] Joseph K. Bradley commented on SPARK-10433: --- Has this been reported on 1.5? I've seen reports for 1.4, but was told by [~dbtsai] that 1.5 seems to have fixed this issue. I believe that the caching (and optional checkpointing) added in 1.5 fix this issue, but it would be great to get confirmation. > Gradient boosted trees > -- > > Key: SPARK-10433 > URL: https://issues.apache.org/jira/browse/SPARK-10433 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1, 1.5.0 >Reporter: Sean Owen > > (Sorry to say I don't have any leads on a fix, but this was reported by three > different people and I confirmed it at fairly close range, so think it's > legitimate:) > This is probably best explained in the words from the mailing list thread at > http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E > . Matt Forbes says: > {quote} > I am training a boosted trees model on a couple million input samples (with > around 300 features) and am noticing that the input size of each stage is > increasing each iteration. For each new tree, the first step seems to be > building the decision tree metadata, which does a .count() on the input data, > so this is the step I've been using to track the input size changing. Here is > what I'm seeing: > {quote} > {code} > count at DecisionTreeMetadata.scala:111 > 1. Input Size / Records: 726.1 MB / 1295620 > 2. Input Size / Records: 106.9 GB / 64780816 > 3. Input Size / Records: 160.3 GB / 97171224 > 4. Input Size / Records: 214.8 GB / 129680959 > 5. Input Size / Records: 268.5 GB / 162533424 > > Input Size / Records: 1912.6 GB / 1382017686 > > {code} > {quote} > This step goes from taking less than 10s up to 5 minutes by the 15th or so > iteration. I'm not quite sure what could be causing this. I am passing a > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > {quote} > Johannes Bauer showed me a very similar problem. > Peter Rudenko offers this sketch of a reproduction: > {code} > val boostingStrategy = BoostingStrategy.defaultParams("Classification") > boostingStrategy.setNumIterations(30) > boostingStrategy.setLearningRate(1.0) > boostingStrategy.treeStrategy.setMaxDepth(3) > boostingStrategy.treeStrategy.setMaxBins(128) > boostingStrategy.treeStrategy.setSubsamplingRate(1.0) > boostingStrategy.treeStrategy.setMinInstancesPerNode(1) > boostingStrategy.treeStrategy.setUseNodeIdCache(true) > boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( > > mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, > java.lang.Integer]]) > val model = GradientBoostedTrees.train(instances, boostingStrategy) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values
[ https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10439: Assignee: Apache Spark > Catalyst should check for overflow / underflow of date and timestamp values > --- > > Key: SPARK-10439 > URL: https://issues.apache.org/jira/browse/SPARK-10439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > While testing some code, I noticed that a few methods in {{DateTimeUtils}} > are prone to overflow and underflow. > For example, {{millisToDays}} can overflow the return type ({{Int}}) if a > large enough input value is provided. > Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which > can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the > negative case). > There might be others but these were the ones that caught my eye. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values
[ https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731392#comment-14731392 ] Apache Spark commented on SPARK-10439: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/8606 > Catalyst should check for overflow / underflow of date and timestamp values > --- > > Key: SPARK-10439 > URL: https://issues.apache.org/jira/browse/SPARK-10439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Priority: Minor > > While testing some code, I noticed that a few methods in {{DateTimeUtils}} > are prone to overflow and underflow. > For example, {{millisToDays}} can overflow the return type ({{Int}}) if a > large enough input value is provided. > Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which > can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the > negative case). > There might be others but these were the ones that caught my eye. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values
[ https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10439: Assignee: (was: Apache Spark) > Catalyst should check for overflow / underflow of date and timestamp values > --- > > Key: SPARK-10439 > URL: https://issues.apache.org/jira/browse/SPARK-10439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Priority: Minor > > While testing some code, I noticed that a few methods in {{DateTimeUtils}} > are prone to overflow and underflow. > For example, {{millisToDays}} can overflow the return type ({{Int}}) if a > large enough input value is provided. > Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which > can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the > negative case). > There might be others but these were the ones that caught my eye. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true
[ https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-10414. - Resolution: Duplicate This looks like a duplicate of an existing JIRA and PR. [~vinodkc], could you please close this and help review the existing PR? Thanks! > DenseMatrix gives different hashcode even though equals returns true > > > Key: SPARK-10414 > URL: https://issues.apache.org/jira/browse/SPARK-10414 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Vinod KC >Priority: Minor > > hashcode implementation in DenseMatrix gives different result for same input > val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) > assert(dm1 === dm) // passed > assert(dm1.hashCode === dm.hashCode) // Failed > This violates the hashCode/equals contract. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10454: Assignee: (was: Apache Spark) > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Priority: Minor > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731380#comment-14731380 ] Apache Spark commented on SPARK-10454: -- User 'robbinspg' has created a pull request for this issue: https://github.com/apache/spark/pull/8605 > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Priority: Minor > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10454: Assignee: Apache Spark > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Assignee: Apache Spark >Priority: Minor > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
[ https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731377#comment-14731377 ] Pete Robbins commented on SPARK-10454: -- This is another case of not waiting for events to drain form the listenerBus > Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause > multiple concurrent attempts for the same map stage > - > > Key: SPARK-10454 > URL: https://issues.apache.org/jira/browse/SPARK-10454 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.5.1 >Reporter: Pete Robbins >Priority: Minor > > test case fails intermittently in Jenkins. > For eg, see the following builds- > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage
Pete Robbins created SPARK-10454: Summary: Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage Key: SPARK-10454 URL: https://issues.apache.org/jira/browse/SPARK-10454 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 1.5.1 Reporter: Pete Robbins Priority: Minor test case fails intermittently in Jenkins. For eg, see the following builds- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10452) Pyspark worker security issue
[ https://issues.apache.org/jira/browse/SPARK-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-10452. Resolution: Not A Problem If you need your workers to run as you user, you need to configure YARN to use Kerberos. > Pyspark worker security issue > - > > Key: SPARK-10452 > URL: https://issues.apache.org/jira/browse/SPARK-10452 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 > Environment: Spark 1.4.0 running on hadoop 2.5.2. >Reporter: Michael Procopio >Priority: Critical > > The python worker launched by the executor is given the credentials used to > launch yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark
[ https://issues.apache.org/jira/browse/SPARK-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-10453. Resolution: Not A Problem >From http://spark.apache.org/docs/latest/running-on-yarn.html: {noformat} spark.yarn.executor.memoryOverhead executorMemory * 0.10, with minimum of 384 The amount of off heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). {noformat} That also counts encompasses the python workers. > There's now way to use spark.dynmicAllocation.enabled with pyspark > -- > > Key: SPARK-10453 > URL: https://issues.apache.org/jira/browse/SPARK-10453 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 > Environment: When using spark.dynamicAllocation.enabled, the > assumption is that memory/core resources will be mediated by the yarn > resource manager. Unfortunately, whatever value is used for > spark.executor.memory is consumed as JVM heap space by the executor. There's > no way to account for the memory requirements of the pyspark worker. > Executor JVM heap space should be decoupled from spark.executor.memory. >Reporter: Michael Procopio > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark
Michael Procopio created SPARK-10453: Summary: There's now way to use spark.dynmicAllocation.enabled with pyspark Key: SPARK-10453 URL: https://issues.apache.org/jira/browse/SPARK-10453 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: When using spark.dynamicAllocation.enabled, the assumption is that memory/core resources will be mediated by the yarn resource manager. Unfortunately, whatever value is used for spark.executor.memory is consumed as JVM heap space by the executor. There's no way to account for the memory requirements of the pyspark worker. Executor JVM heap space should be decoupled from spark.executor.memory. Reporter: Michael Procopio -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9666) ML 1.5 QA: model save/load audit
[ https://issues.apache.org/jira/browse/SPARK-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731299#comment-14731299 ] Joseph K. Bradley commented on SPARK-9666: -- Thanks for checking. Shall I mark this complete? > ML 1.5 QA: model save/load audit > > > Key: SPARK-9666 > URL: https://issues.apache.org/jira/browse/SPARK-9666 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > We should check to make sure no changes broke model import/export in > spark.mllib. > * If a model's name, data members, or constructors have changed _at all_, > then we likely need to support a new save/load format version. Different > versions must be tested in unit tests to ensure backwards compatibility > (i.e., verify we can load old model formats). > * Examples in the programming guide should include save/load when available. > It's important to try running each example in the guide whenever it is > modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10452) Pyspark worker security issue
Michael Procopio created SPARK-10452: Summary: Pyspark worker security issue Key: SPARK-10452 URL: https://issues.apache.org/jira/browse/SPARK-10452 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 running on hadoop 2.5.2. Reporter: Michael Procopio Priority: Critical The python worker launched by the executor is given the credentials used to launch yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan
[ https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10451: Assignee: (was: Apache Spark) > Prevent unnecessary serializations in InMemoryColumnarTableScan > --- > > Key: SPARK-10451 > URL: https://issues.apache.org/jira/browse/SPARK-10451 > Project: Spark > Issue Type: Improvement >Reporter: Yash Datta > > In InMemorycolumnarTableScan, seriliazation of certain fields like > buildFilter, InMemoryRelation etc can be avoided during task execution by > carefully managing the clsoure of mapPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan
[ https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731283#comment-14731283 ] Apache Spark commented on SPARK-10451: -- User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/8604 > Prevent unnecessary serializations in InMemoryColumnarTableScan > --- > > Key: SPARK-10451 > URL: https://issues.apache.org/jira/browse/SPARK-10451 > Project: Spark > Issue Type: Improvement >Reporter: Yash Datta > > In InMemorycolumnarTableScan, seriliazation of certain fields like > buildFilter, InMemoryRelation etc can be avoided during task execution by > carefully managing the clsoure of mapPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan
[ https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10451: Assignee: Apache Spark > Prevent unnecessary serializations in InMemoryColumnarTableScan > --- > > Key: SPARK-10451 > URL: https://issues.apache.org/jira/browse/SPARK-10451 > Project: Spark > Issue Type: Improvement >Reporter: Yash Datta >Assignee: Apache Spark > > In InMemorycolumnarTableScan, seriliazation of certain fields like > buildFilter, InMemoryRelation etc can be avoided during task execution by > carefully managing the clsoure of mapPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan
Yash Datta created SPARK-10451: -- Summary: Prevent unnecessary serializations in InMemoryColumnarTableScan Key: SPARK-10451 URL: https://issues.apache.org/jira/browse/SPARK-10451 Project: Spark Issue Type: Improvement Reporter: Yash Datta In InMemorycolumnarTableScan, seriliazation of certain fields like buildFilter, InMemoryRelation etc can be avoided during task execution by carefully managing the clsoure of mapPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10450) Minor SQL style, format, typo, readability fixes
[ https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10450: Assignee: Andrew Or (was: Apache Spark) > Minor SQL style, format, typo, readability fixes > > > Key: SPARK-10450 > URL: https://issues.apache.org/jira/browse/SPARK-10450 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's > more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10450) Minor SQL style, format, typo, readability fixes
[ https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731211#comment-14731211 ] Apache Spark commented on SPARK-10450: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/8603 > Minor SQL style, format, typo, readability fixes > > > Key: SPARK-10450 > URL: https://issues.apache.org/jira/browse/SPARK-10450 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's > more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10450) Minor SQL style, format, typo, readability fixes
[ https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10450: Assignee: Apache Spark (was: Andrew Or) > Minor SQL style, format, typo, readability fixes > > > Key: SPARK-10450 > URL: https://issues.apache.org/jira/browse/SPARK-10450 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Minor > > This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's > more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10450) Minor SQL style, format, typo, readability fixes
Andrew Or created SPARK-10450: - Summary: Minor SQL style, format, typo, readability fixes Key: SPARK-10450 URL: https://issues.apache.org/jira/browse/SPARK-10450 Project: Spark Issue Type: Improvement Components: SQL Reporter: Andrew Or Assignee: Andrew Or Priority: Minor This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's more of a continuous process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10449) StructType.merge shouldn't merge DecimalTypes with different precisions and/or scales
Cheng Lian created SPARK-10449: -- Summary: StructType.merge shouldn't merge DecimalTypes with different precisions and/or scales Key: SPARK-10449 URL: https://issues.apache.org/jira/browse/SPARK-10449 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1, 1.3.1, 1.5.0 Reporter: Cheng Lian Schema merging should only handle struct fields. But currently we also reconcile decimal precision and scale information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9925) Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests
[ https://issues.apache.org/jira/browse/SPARK-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731170#comment-14731170 ] Apache Spark commented on SPARK-9925: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8602 > Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests > -- > > Key: SPARK-9925 > URL: https://issues.apache.org/jira/browse/SPARK-9925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, in our TestSQLContext/TestHiveContext, we use {{override def > numShufflePartitions: Int = this.getConf(SQLConf.SHUFFLE_PARTITIONS, 5)}} to > set {{SHUFFLE_PARTITIONS}}. However, we never put it to SQLConf. So, after we > use {{withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "number")}}, the number > of shuffle partitions will be set back to 200. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9
[ https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731164#comment-14731164 ] Justin Uang commented on SPARK-10447: - Agreed, I'm pretty sure that this will break some APIs and we'll have to fix those as we do the upgrade =). > Upgrade pyspark to use py4j 0.9 > --- > > Key: SPARK-10447 > URL: https://issues.apache.org/jira/browse/SPARK-10447 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Justin Uang > > This was recently released, and it has many improvements, especially the > following: > {quote} > Python side: IDEs and interactive interpreters such as IPython can now get > help text/autocompletion for Java classes, objects, and members. This makes > Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). > Thanks to @jonahkichwacoders > {quote} > Normally we wrap all the APIs in spark, but for the ones that aren't, this > would make it easier to offroad by using the java proxy objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org