date:20150904

[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2015-09-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731838#comment-14731838
 ] 

Joseph K. Bradley commented on SPARK-9961:
--

By "evaluator," I mean the Evaluator types in spark.ml.evaluation, which can be 
used for model selection.

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to 
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731834#comment-14731834
 ] 

Joseph K. Bradley commented on SPARK-10199:
---

I agree that the work required for these changes is large compared to the small 
gains for most use cases.  I could imagine allocating time to get this merged 
at some point in the future, but I don't think it can be prioritized right now. 
 I'd recommend keeping your code branch for the future, but closing the PR and 
marking this JIRA to be addressed later.

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5337) respect spark.task.cpus when launch executors

2015-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731826#comment-14731826
 ] 

Apache Spark commented on SPARK-5337:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/8610

> respect spark.task.cpus when launch executors
> -
>
> Key: SPARK-5337
> URL: https://issues.apache.org/jira/browse/SPARK-5337
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0
>Reporter: Tao Wang
>
> In standalone mode, we did not respect spark.task.cpus when luanch executors. 
> Some executors would have not enough cores to launch a single task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2015-09-04 Thread George Dittmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731810#comment-14731810
 ] 

George Dittmar edited comment on SPARK-9961 at 9/5/15 5:23 AM:
---

Can you expand on what you mean by Evaluator? Just looking for something to 
eval how good predictions are?


was (Author: georgedittmar):
Can you expand on what you mean by Evaluator? Just looking for something to 
eval how good predictions are? 

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to 
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2015-09-04 Thread George Dittmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731810#comment-14731810
 ] 

George Dittmar commented on SPARK-9961:
---

Can you expand on what you mean by Evaluator? Just looking for something to 
eval how good predictions are? 

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to 
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-8632:
-

Assignee: Davies Liu

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10459) PythonUDF could process UnsafeRow

2015-09-04 Thread Davies Liu (JIRA)

Davies Liu created SPARK-10459:
--

 Summary: PythonUDF could process UnsafeRow
 Key: SPARK-10459
 URL: https://issues.apache.org/jira/browse/SPARK-10459
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


Currently, There will be ConvertToSafe for PythonUDF, that's not needed 
actually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true

2015-09-04 Thread Vinod KC (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731757#comment-14731757
 ] 

Vinod KC commented on SPARK-10414:
--

Thanks
Got the JIRA id
https://issues.apache.org/jira/browse/SPARK-9919

> DenseMatrix gives different hashcode even though equals returns true
> 
>
> Key: SPARK-10414
> URL: https://issues.apache.org/jira/browse/SPARK-10414
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Vinod KC
>Priority: Minor
>
> hashcode implementation in DenseMatrix gives different result for same input
> val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> assert(dm1 === dm) // passed
> assert(dm1.hashCode === dm.hashCode) // Failed
> This violates the hashCode/equals contract.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-04 Thread Vinod KC (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731753#comment-14731753
 ] 

Vinod KC commented on SPARK-10199:
--

[~mengxr]
Thanks for the suggestion. 
Shall I close the PR?

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true

2015-09-04 Thread Vinod KC (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731745#comment-14731745
 ] 

Vinod KC commented on SPARK-10414:
--

[~josephkb]
Could you please share me that existing JIRA id to review the PR
Thanks


> DenseMatrix gives different hashcode even though equals returns true
> 
>
> Key: SPARK-10414
> URL: https://issues.apache.org/jira/browse/SPARK-10414
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Vinod KC
>Priority: Minor
>
> hashcode implementation in DenseMatrix gives different result for same input
> val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> assert(dm1 === dm) // passed
> assert(dm1.hashCode === dm.hashCode) // Failed
> This violates the hashCode/equals contract.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7257) Find nearest neighbor satisfying predicate

2015-09-04 Thread Luvsandondov Lkhamsuren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731744#comment-14731744
 ] 

Luvsandondov Lkhamsuren commented on SPARK-7257:


This sounds very interesting! If I understood correctly, having multiple 
vertices satisfying the predicate (let's call the set P, which is a subset of 
V), and we want to find set of vertices from the P that is closest. 
Is it guaranteed that |P| << |V|? What is the use case you'd in mind 
[~josephkb]?

> Find nearest neighbor satisfying predicate
> --
>
> Key: SPARK-7257
> URL: https://issues.apache.org/jira/browse/SPARK-7257
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It would be useful to be able to find nearest neighbors satisfying 
> predicates.  E.g.:
> * Given one or more starting vertices, plus a predicate.
> * Find the closest vertex or vertices satisfying the predicate.
> This is different from ShortestPaths in that ShortestPaths searches for a 
> fixed (small) set of vertices, rather than all vertices satisfying a 
> predicate (which could be a large set).
> It could be implemented using BFS from the initial vertex/vertices, though 
> faster implementations might also search from vertices satisfying the 
> predicate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9925) Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-9925.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests
> --
>
> Key: SPARK-9925
> URL: https://issues.apache.org/jira/browse/SPARK-9925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>
> Right now, in our TestSQLContext/TestHiveContext, we use {{override def 
> numShufflePartitions: Int = this.getConf(SQLConf.SHUFFLE_PARTITIONS, 5)}} to 
> set {{SHUFFLE_PARTITIONS}}. However, we never put it to SQLConf. So, after we 
> use {{withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "number")}}, the number 
> of shuffle partitions will be set back to 200.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9963:
---

Assignee: (was: Apache Spark)

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731702#comment-14731702
 ] 

Apache Spark commented on SPARK-9963:
-

User 'lkhamsurenl' has created a pull request for this issue:
https://github.com/apache/spark/pull/8609

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9963:
---

Assignee: Apache Spark

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8630) Prevent from checkpointing QueueInputDStream

2015-09-04 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731671#comment-14731671
 ] 

Tathagata Das edited comment on SPARK-8630 at 9/5/15 12:50 AM:
---

@ all in this thread: Are you having this problem with 1.4.1 or 1.5.0 (in RC3)?



was (Author: tdas):
@ all in this thread: Are you having this problem with 1.4.1 or 1.5.0?


> Prevent from checkpointing QueueInputDStream
> 
>
> Key: SPARK-8630
> URL: https://issues.apache.org/jira/browse/SPARK-8630
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.4.1, 1.5.0
>
>
> It's better to prevent from checkpointing QueueInputDStream rather than 
> failing the application when recovering `QueueInputDStream`, so that people 
> can find the issue as soon as possible. See SPARK-8553 for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8630) Prevent from checkpointing QueueInputDStream

2015-09-04 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731671#comment-14731671
 ] 

Tathagata Das commented on SPARK-8630:
--

@ all in this thread: Are you having this problem with 1.4.1 or 1.5.0?


> Prevent from checkpointing QueueInputDStream
> 
>
> Key: SPARK-8630
> URL: https://issues.apache.org/jira/browse/SPARK-8630
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.4.1, 1.5.0
>
>
> It's better to prevent from checkpointing QueueInputDStream rather than 
> failing the application when recovering `QueueInputDStream`, so that people 
> can find the issue as soon as possible. See SPARK-8553 for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9235) PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting as driver in yarn-cluster mode

2015-09-04 Thread bijaya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731662#comment-14731662
 ] 

bijaya edited comment on SPARK-9235 at 9/5/15 12:41 AM:


SPARK_YARN_USER_ENV="PYSPARK_PYTHON=
for example.

 spark-submit --master yarn --deploy-mode cluster --conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON= --num-executors 2 --driver-memory 2g --executor-memory 1g 
--executor-cores 1 .py


was (Author: bijaya):
SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/srv/software/anaconda/bin/python" works if 
we are submitting job to yarn_client mode

for yarn_cluster mode we have to insert env variable via 
spark.yarn.appMasterEnv.
for example.

 spark-submit --master yarn --deploy-mode cluster --conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON= 
--num-executors 2 --driver-memory 2g --executor-memory 1g --executor-cores 1 
.py

> PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting 
> as driver in yarn-cluster mode
> 
>
> Key: SPARK-9235
> URL: https://issues.apache.org/jira/browse/SPARK-9235
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
> Environment: CentOS 6.6, python 2.7, Spark 1.4.1 tagged version, YARN 
> Cluster Manager, CDH 5.4.1 (Hadoop 2.6.0++), Java 1.7
>Reporter: Aaron Glahe
>Priority: Minor
>
> Relates to SPARK-9229
> Env:  Spark on YARN, Java 1.7, Centos 6.6, CDH 5.4.1 (Hadoop 2.6.0++), 
> Anaconda Python 2.7.10 "installed" in /srv/software directory
> On a client/submitting machine, we set the PYSPARK_DRIVER_PYTHON env var in 
> spark-env.sh that pointed the anaconda python executable, which was on every 
> YARN node: 
> export PYSPARK_DRIVER_PYTHON='/srv/software/anaconda/bin/python'
> side note, export PYSPARK_PYTHON='/srv/software/anaconda/bin/python' was set 
> as well in the spark-env.sh.
> run the command:
> spark-submit test.py --master yarn --deploy-mode cluster
> It appears as though the Node Manager with the DRIVER does not use the 
> PYSPARK_DRIVER_PYTHON env python, but instead uses the CentOS system default 
> (which in this case is python 2.6).
> Workaround appears to setting the python path in the SPARK_YARN_USER_ENV



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9235) PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting as driver in yarn-cluster mode

2015-09-04 Thread bijaya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731662#comment-14731662
 ] 

bijaya commented on SPARK-9235:
---

SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/srv/software/anaconda/bin/python" works if 
we are submitting job to yarn_client mode

for yarn_cluster mode we have to insert env variable via 
spark.yarn.appMasterEnv.
for example.

 spark-submit --master yarn --deploy-mode cluster --conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON= 
--num-executors 2 --driver-memory 2g --executor-memory 1g --executor-cores 1 
.py

> PYSPARK_DRIVER_PYTHON env variable is not set on the YARN Node manager acting 
> as driver in yarn-cluster mode
> 
>
> Key: SPARK-9235
> URL: https://issues.apache.org/jira/browse/SPARK-9235
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
> Environment: CentOS 6.6, python 2.7, Spark 1.4.1 tagged version, YARN 
> Cluster Manager, CDH 5.4.1 (Hadoop 2.6.0++), Java 1.7
>Reporter: Aaron Glahe
>Priority: Minor
>
> Relates to SPARK-9229
> Env:  Spark on YARN, Java 1.7, Centos 6.6, CDH 5.4.1 (Hadoop 2.6.0++), 
> Anaconda Python 2.7.10 "installed" in /srv/software directory
> On a client/submitting machine, we set the PYSPARK_DRIVER_PYTHON env var in 
> spark-env.sh that pointed the anaconda python executable, which was on every 
> YARN node: 
> export PYSPARK_DRIVER_PYTHON='/srv/software/anaconda/bin/python'
> side note, export PYSPARK_PYTHON='/srv/software/anaconda/bin/python' was set 
> as well in the spark-env.sh.
> run the command:
> spark-submit test.py --master yarn --deploy-mode cluster
> It appears as though the Node Manager with the DRIVER does not use the 
> PYSPARK_DRIVER_PYTHON env python, but instead uses the CentOS system default 
> (which in this case is python 2.6).
> Workaround appears to setting the python path in the SPARK_YARN_USER_ENV



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10402) Add scaladoc for default values of params in ML

2015-09-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10402.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8591
[https://github.com/apache/spark/pull/8591]

> Add scaladoc for default values of params in ML
> ---
>
> Key: SPARK-10402
> URL: https://issues.apache.org/jira/browse/SPARK-10402
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> We should make sure the scaladoc for params includes their default values 
> through the models in ml/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping

2015-09-04 Thread Matt Cheah (JIRA)

Matt Cheah created SPARK-10458:
--

 Summary: Would like to know if a given Spark Context is stopped or 
currently stopping
 Key: SPARK-10458
 URL: https://issues.apache.org/jira/browse/SPARK-10458
 Project: Spark
  Issue Type: Improvement
Reporter: Matt Cheah
Priority: Minor


I ran into a case where a thread stopped a Spark Context, specifically when I 
hit the "kill" link from the Spark standalone UI. There was no real way for 
another thread to know that the context had stopped and thus should have 
handled that accordingly.

Checking that the SparkEnv is null is one way, but that doesn't handle the case 
where the context is in the midst of stopping, and stopping the context may 
actually not be instantaneous - in my case for some reason the DAGScheduler was 
taking a non-trivial amount of time to stop.

Implementation wise I'm more or less requesting the boolean value returned from 
SparkContext.stopped.get() to be visible in some way. As long as we return the 
value and not the Atomic Boolean itself (we wouldn't want anyone to be setting 
this, after all!) it would help client applications check the context's 
liveliness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10397:


Assignee: Apache Spark

> Make Python's SparkContext self-descriptive on "print sc"
> -
>
> Key: SPARK-10397
> URL: https://issues.apache.org/jira/browse/SPARK-10397
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Sergey Tryuber
>Assignee: Apache Spark
>Priority: Trivial
>
> When I execute in Python shell:
> {code}
> print sc
> {code}
> I receive something like:
> {noformat}
> 
> {noformat}
> But this is very inconvenient, especially if a user wants to create a 
> good-looking and self-descriptive IPython Notebook. He would like to see some 
> information about his Spark cluster.
> In contrast, H2O context does have this feature and it is very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10397:


Assignee: (was: Apache Spark)

> Make Python's SparkContext self-descriptive on "print sc"
> -
>
> Key: SPARK-10397
> URL: https://issues.apache.org/jira/browse/SPARK-10397
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Sergey Tryuber
>Priority: Trivial
>
> When I execute in Python shell:
> {code}
> print sc
> {code}
> I receive something like:
> {noformat}
> 
> {noformat}
> But this is very inconvenient, especially if a user wants to create a 
> good-looking and self-descriptive IPython Notebook. He would like to see some 
> information about his Spark cluster.
> In contrast, H2O context does have this feature and it is very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"

2015-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731618#comment-14731618
 ] 

Apache Spark commented on SPARK-10397:
--

User 'alexrovner' has created a pull request for this issue:
https://github.com/apache/spark/pull/8608

> Make Python's SparkContext self-descriptive on "print sc"
> -
>
> Key: SPARK-10397
> URL: https://issues.apache.org/jira/browse/SPARK-10397
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Sergey Tryuber
>Priority: Trivial
>
> When I execute in Python shell:
> {code}
> print sc
> {code}
> I receive something like:
> {noformat}
> 
> {noformat}
> But this is very inconvenient, especially if a user wants to create a 
> good-looking and self-descriptive IPython Notebook. He would like to see some 
> information about his Spark cluster.
> In contrast, H2O context does have this feature and it is very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"

2015-09-04 Thread Alex Rovner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731619#comment-14731619
 ] 

Alex Rovner commented on SPARK-10397:
-

Pull: https://github.com/apache/spark/pull/8608

{noformat}
>>> sc
{'_accumulatorServer': ,
 '_batchSize': 0,
 '_callsite': CallSite(function='', 
file='/Users/alex.rovner/git/spark/python/pyspark/shell.py', linenum=43),
 '_conf': {'_jconf': JavaObject id=o0},
 '_javaAccumulator': JavaObject id=o11,
 '_jsc': JavaObject id=o8,
 '_pickled_broadcast_vars': set([]),
 '_python_includes': [],
 '_temp_dir': 
u'/private/var/folders/hj/v4zb0_f159q8mt4w3j8m2_mrgp/T/spark-a9cc47a9-db90-49a3-a82e-263f0b56268c/pyspark-773c7490-2b2d-4418-a030-256a5b9c1fe1',
 '_unbatched_serializer': PickleSerializer(),
 'appName': u'PySparkShell',
 'environment': {},
 'master': u'local[*]',
 'profiler_collector': None,
 'pythonExec': 'python2.7',
 'pythonVer': '2.7',
 'serializer': AutoBatchedSerializer(PickleSerializer()),
 'sparkHome': None}
>>> print sc
{'_accumulatorServer': ,
 '_batchSize': 0,
 '_callsite': CallSite(function='', 
file='/Users/alex.rovner/git/spark/python/pyspark/shell.py', linenum=43),
 '_conf': {'_jconf': JavaObject id=o0},
 '_javaAccumulator': JavaObject id=o11,
 '_jsc': JavaObject id=o8,
 '_pickled_broadcast_vars': set([]),
 '_python_includes': [],
 '_temp_dir': 
u'/private/var/folders/hj/v4zb0_f159q8mt4w3j8m2_mrgp/T/spark-a9cc47a9-db90-49a3-a82e-263f0b56268c/pyspark-773c7490-2b2d-4418-a030-256a5b9c1fe1',
 '_unbatched_serializer': PickleSerializer(),
 'appName': u'PySparkShell',
 'environment': {},
 'master': u'local[*]',
 'profiler_collector': None,
 'pythonExec': 'python2.7',
 'pythonVer': '2.7',
 'serializer': AutoBatchedSerializer(PickleSerializer()),
 'sparkHome': None}
>>> 

{noformat}

> Make Python's SparkContext self-descriptive on "print sc"
> -
>
> Key: SPARK-10397
> URL: https://issues.apache.org/jira/browse/SPARK-10397
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Sergey Tryuber
>Priority: Trivial
>
> When I execute in Python shell:
> {code}
> print sc
> {code}
> I receive something like:
> {noformat}
> 
> {noformat}
> But this is very inconvenient, especially if a user wants to create a 
> good-looking and self-descriptive IPython Notebook. He would like to see some 
> information about his Spark cluster.
> In contrast, H2O context does have this feature and it is very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread Justin Uang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731598#comment-14731598
 ] 

Justin Uang commented on SPARK-10447:
-

Sound good



> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731597#comment-14731597
 ] 

holdenk commented on SPARK-10447:
-

Sure, I'll ping you when I've got the PR ready (probably sometime this long 
weekend) if that's good for you?

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10436) spark-submit overwrites spark.files defaults with the job script filename

2015-09-04 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731593#comment-14731593
 ] 

holdenk commented on SPARK-10436:
-

I've worked on some of the older versions of the submit scripts and should 
probably refresh my knowledge, I can tackle this unless someone else is already 
planning on :)

> spark-submit overwrites spark.files defaults with the job script filename
> -
>
> Key: SPARK-10436
> URL: https://issues.apache.org/jira/browse/SPARK-10436
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.0
> Environment: Ubuntu, Spark 1.4.0 Standalone
>Reporter: axel dahl
>Priority: Minor
>  Labels: easyfix, feature
>
> In my spark-defaults.conf I have configured a set of libararies to be 
> uploaded to my Spark 1.4.0 Standalone cluster.  The entry appears as:
> spark.files  libarary.zip,file1.py,file2.py
> When I execute spark-submit -v test.py
> I see that spark-submit reads the defaults correctly, but that it overwrites 
> the "spark.files" default entry and replaces it with the name if the job 
> script, i.e. "test.py".
> This behavior doesn't seem intuitive.  test.py, should be added to the spark 
> working folder, but it should not overwrite the "spark.files" defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread Justin Uang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731592#comment-14731592
 ] 

Justin Uang commented on SPARK-10447:
-

Sure, I wouldn't mind doing the code review. Can you add me?



> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731591#comment-14731591
 ] 

holdenk commented on SPARK-10447:
-

I can give this a shot if no one else is interested in doing this (I've been 
wrangling some py4j bits with Sparkling Pandas).

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10013) Remove Java assert from Java unit tests

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10013:


Assignee: Apache Spark

> Remove Java assert from Java unit tests
> ---
>
> Key: SPARK-10013
> URL: https://issues.apache.org/jira/browse/SPARK-10013
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> We should use assertTrue, etc. instead to make sure the asserts are not 
> ignored in tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10013) Remove Java assert from Java unit tests

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10013:


Assignee: (was: Apache Spark)

> Remove Java assert from Java unit tests
> ---
>
> Key: SPARK-10013
> URL: https://issues.apache.org/jira/browse/SPARK-10013
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> We should use assertTrue, etc. instead to make sure the asserts are not 
> ignored in tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10013) Remove Java assert from Java unit tests

2015-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731585#comment-14731585
 ] 

Apache Spark commented on SPARK-10013:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8607

> Remove Java assert from Java unit tests
> ---
>
> Key: SPARK-10013
> URL: https://issues.apache.org/jira/browse/SPARK-10013
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> We should use assertTrue, etc. instead to make sure the asserts are not 
> ignored in tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8630) Prevent from checkpointing QueueInputDStream

2015-09-04 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731556#comment-14731556
 ] 

Tathagata Das commented on SPARK-8630:
--

I get your point. I guess it will be good to revert this patch and just 
generate a WARNING.
A more principled solution would be allow users to separately enable RDD 
checkpointing and DStream checkpointing, so that they can only enable RDD 
checkpointing and keep DStream checkpointing disabled (for queueStream to work).

> Prevent from checkpointing QueueInputDStream
> 
>
> Key: SPARK-8630
> URL: https://issues.apache.org/jira/browse/SPARK-8630
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.4.1, 1.5.0
>
>
> It's better to prevent from checkpointing QueueInputDStream rather than 
> failing the application when recovering `QueueInputDStream`, so that people 
> can find the issue as soon as possible. See SPARK-8553 for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new

2015-09-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10311:
--
Assignee: meiyoula

> In cluster mode, AppId and AttemptId should be update when ApplicationMaster 
> is new
> ---
>
> Key: SPARK-10311
> URL: https://issues.apache.org/jira/browse/SPARK-10311
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: meiyoula
>Assignee: meiyoula
> Fix For: 1.6.0, 1.5.1
>
>
> When I start a streaming app with checkpoint data in yarn-cluster mode, the 
> appId and attempId are old(which app first create the checkpoint data), and 
> the event log writes into the old file name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new

2015-09-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10311.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

> In cluster mode, AppId and AttemptId should be update when ApplicationMaster 
> is new
> ---
>
> Key: SPARK-10311
> URL: https://issues.apache.org/jira/browse/SPARK-10311
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: meiyoula
> Fix For: 1.6.0, 1.5.1
>
>
> When I start a streaming app with checkpoint data in yarn-cluster mode, the 
> appId and attempId are old(which app first create the checkpoint data), and 
> the event log writes into the old file name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10420) Implementing Reactive Streams based Spark Streaming Receiver

2015-09-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10420:
--
Target Version/s: 1.6.0  (was: )

> Implementing Reactive Streams based Spark Streaming Receiver
> 
>
> Key: SPARK-10420
> URL: https://issues.apache.org/jira/browse/SPARK-10420
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Nilanjan Raychaudhuri
>Priority: Minor
>
> Hello TD,
> This is probably the last bit of the back-pressure story, implementing 
> ReactiveStreams based Spark streaming receivers. After discussing about this 
> with my Typesafe team we came up with the following design document
> https://docs.google.com/document/d/1lGQKXfNznd5SPuQigvCdLsudl-gcvWKuHWr0Bpn3y30/edit?usp=sharing
> Could you please take a look at this when you get a chance?
> Thanks
> Nilanjan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10433) Gradient boosted trees

2015-09-04 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731533#comment-14731533
 ] 

DB Tsai commented on SPARK-10433:
-

[~sowen] I can confirm that this should be fixed in 1.5

> Gradient boosted trees
> --
>
> Key: SPARK-10433
> URL: https://issues.apache.org/jira/browse/SPARK-10433
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Sean Owen
>
> (Sorry to say I don't have any leads on a fix, but this was reported by three 
> different people and I confirmed it at fairly close range, so think it's 
> legitimate:)
> This is probably best explained in the words from the mailing list thread at 
> http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
>  . Matt Forbes says:
> {quote}
> I am training a boosted trees model on a couple million input samples (with 
> around 300 features) and am noticing that the input size of each stage is 
> increasing each iteration. For each new tree, the first step seems to be 
> building the decision tree metadata, which does a .count() on the input data, 
> so this is the step I've been using to track the input size changing. Here is 
> what I'm seeing: 
> {quote}
> {code}
> count at DecisionTreeMetadata.scala:111 
> 1. Input Size / Records: 726.1 MB / 1295620 
> 2. Input Size / Records: 106.9 GB / 64780816 
> 3. Input Size / Records: 160.3 GB / 97171224 
> 4. Input Size / Records: 214.8 GB / 129680959 
> 5. Input Size / Records: 268.5 GB / 162533424 
>  
> Input Size / Records: 1912.6 GB / 1382017686 
>  
> {code}
> {quote}
> This step goes from taking less than 10s up to 5 minutes by the 15th or so 
> iteration. I'm not quite sure what could be causing this. I am passing a 
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train 
> {quote}
> Johannes Bauer showed me a very similar problem.
> Peter Rudenko offers this sketch of a reproduction:
> {code}
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
> boostingStrategy.setNumIterations(30)
> boostingStrategy.setLearningRate(1.0)
> boostingStrategy.treeStrategy.setMaxDepth(3)
> boostingStrategy.treeStrategy.setMaxBins(128)
> boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
> boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
> boostingStrategy.treeStrategy.setUseNodeIdCache(true)
> boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>   
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
>  java.lang.Integer]])
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10304:
--
Target Version/s:   (was: 1.6.0, 1.5.1)

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10304:
--
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.1,1.6.0)

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10304:
--
Target Version/s: 1.6.0, 1.5.1

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10454.
---
  Resolution: Fixed
   Fix Version/s: 1.5.1
  1.6.0
Target Version/s: 1.6.0, 1.5.1

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>Priority: Critical
>  Labels: flaky-test
> Fix For: 1.6.0, 1.5.1
>
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10454:
--
Priority: Critical  (was: Minor)

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Priority: Critical
>  Labels: flaky-test
> Fix For: 1.6.0, 1.5.1
>
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10454:
--
Assignee: Pete Robbins

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>Priority: Critical
>  Labels: flaky-test
> Fix For: 1.6.0, 1.5.1
>
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10454:
--
Labels: flaky-test  (was: )

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>Priority: Critical
>  Labels: flaky-test
> Fix For: 1.6.0, 1.5.1
>
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9669) Support PySpark with Mesos Cluster mode

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-9669.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Support PySpark with Mesos Cluster mode
> ---
>
> Key: SPARK-9669
> URL: https://issues.apache.org/jira/browse/SPARK-9669
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos, PySpark
>Affects Versions: 1.5.0
>Reporter: Timothy Chen
>Assignee: Timothy Chen
> Fix For: 1.6.0
>
>
> PySpark with cluster mode with Mesos is not yet supported.
> We need to enable it and make sure it's able to launch Pyspark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10450.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Minor SQL style, format, typo, readability fixes
> 
>
> Key: SPARK-10450
> URL: https://issues.apache.org/jira/browse/SPARK-10450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.6.0
>
>
> This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
> more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-04 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731529#comment-14731529
 ] 

Xiangrui Meng commented on SPARK-10199:
---

The improvement numbers also depends on the model size. In unit tests, the 
model sizes are usually very small. Then the overhead of reflection becomes 
significant. With real models, it could be either the model itself is too small 
or the model is large and then the overhead of reflection becomes 
insignificant. Keeping the code simple and easy to understand is also quite 
important. +[~josephkb]

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10176:
--
Fix Version/s: (was: 1.5.0)
   1.6.0

> Show partially analyzed plan when checkAnswer df fails to resolve
> -
>
> Key: SPARK-10176
> URL: https://issues.apache.org/jira/browse/SPARK-10176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.6.0
>
>
> It would be much easier to debug test failures if we could see the failed 
> plan instead of just the user friendly error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10176.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Show partially analyzed plan when checkAnswer df fails to resolve
> -
>
> Key: SPARK-10176
> URL: https://issues.apache.org/jira/browse/SPARK-10176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> It would be much easier to debug test failures if we could see the failed 
> plan instead of just the user friendly error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10176) Show partially analyzed plan when checkAnswer df fails to resolve

2015-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10176:
--
Target Version/s: 1.6.0  (was: 1.5.0)

> Show partially analyzed plan when checkAnswer df fails to resolve
> -
>
> Key: SPARK-10176
> URL: https://issues.apache.org/jira/browse/SPARK-10176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.6.0
>
>
> It would be much easier to debug test failures if we could see the failed 
> plan instead of just the user friendly error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new

2015-09-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10311:
--
Affects Version/s: 1.5.0
   1.4.1

> In cluster mode, AppId and AttemptId should be update when ApplicationMaster 
> is new
> ---
>
> Key: SPARK-10311
> URL: https://issues.apache.org/jira/browse/SPARK-10311
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: meiyoula
>
> When I start a streaming app with checkpoint data in yarn-cluster mode, the 
> appId and attempId are old(which app first create the checkpoint data), and 
> the event log writes into the old file name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10311) In cluster mode, AppId and AttemptId should be update when ApplicationMaster is new

2015-09-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10311:
--
Target Version/s: 1.6.0, 1.5.1

> In cluster mode, AppId and AttemptId should be update when ApplicationMaster 
> is new
> ---
>
> Key: SPARK-10311
> URL: https://issues.apache.org/jira/browse/SPARK-10311
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: meiyoula
>
> When I start a streaming app with checkpoint data in yarn-cluster mode, the 
> appId and attempId are old(which app first create the checkpoint data), and 
> the event log writes into the old file name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-10457) Unable to connect to MySQL with the DataFrame API

2015-09-04 Thread Mariano Simone (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mariano Simone closed SPARK-10457.
--
Resolution: Fixed

Found the solution.

spark.executor.extraClassPath needed configuration.

> Unable to connect to MySQL with the DataFrame API
> -
>
> Key: SPARK-10457
> URL: https://issues.apache.org/jira/browse/SPARK-10457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Linux singularity 3.13.0-63-generic #103-Ubuntu SMP Fri 
> Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
>  "org.apache.spark" %% "spark-core"% "1.4.1" % "provided",
>   "org.apache.spark" %  "spark-sql_2.10"% "1.4.1" % "provided",
>   "org.apache.spark" %  "spark-streaming_2.10"  % "1.4.1" % "provided",
>   "org.apache.spark" %% "spark-streaming-kafka" % "1.4.1",
>   "mysql"%  "mysql-connector-java"  % "5.1.36"
>Reporter: Mariano Simone
>
> I'm getting this error everytime I try to create a dataframe using jdbc:
> java.sql.SQLException: No suitable driver found for 
> jdbc:mysql://localhost:3306/test
> What I have so far:
> standart sbt project.
> Added the dep. on mysql-connector to build.sbt like this:
> "mysql"%  "mysql-connector-java"  % "5.1.36"
> The code that creates the df:
> val url   = "jdbc:mysql://localhost:3306/test"
> val table = "test_table"
> val properties = new Properties
> properties.put("user", "123")
> properties.put("password", "123")
> properties.put("driver", "com.mysql.jdbc.Driver")
> val tiers  = sqlContext.read.jdbc(url, table, properties)
> I also loaded the jar like this:
> streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar")
> This is the back trace of the exception being thrown:
> 15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 
> 144140266 ms.0
> java.sql.SQLException: No suitable driver found for 
> jdbc:mysql://localhost:3306/test
>   at java.sql.DriverManager.getConnection(DriverManager.java:689)
>   at java.sql.DriverManager.getConnection(DriverManager.java:208)
>   at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
>   at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
>   at com.playtika.etl.Application$.processRDD(Application.scala:69)
>   at 
> com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52)
>   at 
> com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>   at scala.util.Try$.apply(Try.scala:161)
>   at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Let me know if more data is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10457) Unable to connect to MySQL with the DataFrame API

2015-09-04 Thread Mariano Simone (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mariano Simone updated SPARK-10457:
---
Description: 
I'm getting this error everytime I try to create a dataframe using jdbc:

java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test

What I have so far:

standart sbt project.

Added the dep. on mysql-connector to build.sbt like this:
"mysql"%  "mysql-connector-java"  % "5.1.36"

The code that creates the df:
val url   = "jdbc:mysql://localhost:3306/test"
val table = "test_table"

val properties = new Properties
properties.put("user", "123")
properties.put("password", "123")
properties.put("driver", "com.mysql.jdbc.Driver")

val tiers  = sqlContext.read.jdbc(url, table, properties)

I also loaded the jar like this:
streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar")

This is the back trace of the exception being thrown:

15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 
144140266 ms.0
java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
at com.playtika.etl.Application$.processRDD(Application.scala:69)
at 
com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52)
at 
com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Let me know if more data is needed.


  was:
I'm getting this error everytime I try to create a dataframe using jdbc:

java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test

What I have so far:

standart sbt project.

Added the dep. on mysql-connector to build.sbt like this:
"mysql"%  "mysql-connector-java"  % "5.1.36"

The code that creates the df:
val url   = "jdbc:mysql://localhost:3306/test"
val table = "test_table"

val properties = new Properties
properties.put("user", "123")
properties.put("password", "123")
properties.put("driver", "com.mysql.jdbc.Driver")

val tiers  = sqlContext.read.jdbc(url, table, properties)

I also loaded the jar like this:
streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar")

This is the back trace of the exception being thrown:

15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 
144140266 ms.0
java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
at org.apache.spark.

[jira] [Created] (SPARK-10457) Unable to connect to MySQL with the DataFrame API

2015-09-04 Thread Mariano Simone (JIRA)

Mariano Simone created SPARK-10457:
--

 Summary: Unable to connect to MySQL with the DataFrame API
 Key: SPARK-10457
 URL: https://issues.apache.org/jira/browse/SPARK-10457
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Linux singularity 3.13.0-63-generic #103-Ubuntu SMP Fri 
Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)

 "org.apache.spark" %% "spark-core"% "1.4.1" % "provided",
  "org.apache.spark" %  "spark-sql_2.10"% "1.4.1" % "provided",
  "org.apache.spark" %  "spark-streaming_2.10"  % "1.4.1" % "provided",
  "org.apache.spark" %% "spark-streaming-kafka" % "1.4.1",
  "mysql"%  "mysql-connector-java"  % "5.1.36"

Reporter: Mariano Simone


I'm getting this error everytime I try to create a dataframe using jdbc:

java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test

What I have so far:

standart sbt project.

Added the dep. on mysql-connector to build.sbt like this:
"mysql"%  "mysql-connector-java"  % "5.1.36"

The code that creates the df:
val url   = "jdbc:mysql://localhost:3306/test"
val table = "test_table"

val properties = new Properties
properties.put("user", "123")
properties.put("password", "123")
properties.put("driver", "com.mysql.jdbc.Driver")

val tiers  = sqlContext.read.jdbc(url, table, properties)

I also loaded the jar like this:
streamingContext.sparkContext.addJar("mysql-connector-java-5.1.36.jar")

This is the back trace of the exception being thrown:

15/09/04 18:37:40 ERROR JobScheduler: Error running job streaming job 
144140266 ms.0
java.sql.SQLException: No suitable driver found for 
jdbc:mysql://localhost:3306/test
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
at com.playtika.etl.Application$.processRDD(Application.scala:69)
at 
com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:52)
at 
com.playtika.etl.Application$$anonfun$processStream$1.apply(Application.scala:51)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731436#comment-14731436
 ] 

shane knapp edited comment on SPARK-10456 at 9/4/15 9:46 PM:
-

looks like we'll be installing 7u79 (we're at 7u71 currently).


was (Author: shaneknapp):
looks like we'll be installing 7u79 (we're at 7u51 currently).

> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest&greatest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-10455.
---

FIN!

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-10455.
-
Resolution: Done

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10452) Pyspark worker security issue

2015-09-04 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731332#comment-14731332
 ] 

Marcelo Vanzin edited comment on SPARK-10452 at 9/4/15 9:43 PM:


If you need your workers to run as your user, you need to configure YARN to use 
Kerberos.


was (Author: vanzin):
If you need your workers to run as you user, you need to configure YARN to use 
Kerberos.

> Pyspark worker security issue
> -
>
> Key: SPARK-10452
> URL: https://issues.apache.org/jira/browse/SPARK-10452
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0 running on hadoop 2.5.2.
>Reporter: Michael Procopio
>Priority: Critical
>
> The python worker launched by the executor is given the credentials used to 
> launch yarn. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731480#comment-14731480
 ] 

shane knapp commented on SPARK-10456:
-

ok, 79 is installed but i will wait until downtime to switch the symlinks over. 
 here's the command i will be running when that time comes:

pssh -h jenkins_workers.txt "cd /usr/java; rm -f latest; rm -f default; ln -s 
jdk1.7.0_79 latest; ln -s latest default

> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest&greatest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10402) Add scaladoc for default values of params in ML

2015-09-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10402:
--
Shepherd: Joseph K. Bradley
Assignee: holdenk
Target Version/s: 1.6.0, 1.5.1

> Add scaladoc for default values of params in ML
> ---
>
> Key: SPARK-10402
> URL: https://issues.apache.org/jira/browse/SPARK-10402
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
>
> We should make sure the scaladoc for params includes their default values 
> through the models in ml/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731478#comment-14731478
 ] 

shane knapp commented on SPARK-10455:
-

it's installed in:  /usr/java/jdk1.8.0_60

i'll email the dev@ list and let everyone know.

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-09-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731468#comment-14731468
 ] 

Joseph K. Bradley commented on SPARK-9963:
--

Yep, that first case in the if-else is for the right-most bin with range 
[maxSplitValue, +inf]

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl

2015-09-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731462#comment-14731462
 ] 

Joseph K. Bradley commented on SPARK-9963:
--

Sorry for the slow response! (I've been traveling.)  Option 2 sounds best.  It 
can resemble the current predictImpl, but can use the version of shouldGoLeft 
taking binned feature values.

> ML RandomForest cleanup: replace predictNodeIndex with predictImpl
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl.
> This should be straightforward, but please ping me if anything is unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10456:
---
Assignee: shane knapp

> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest&greatest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731440#comment-14731440
 ] 

Josh Rosen commented on SPARK-10455:


Yep, I think we want the 64-bit version.

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10455:
---
Assignee: shane knapp

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>Assignee: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731436#comment-14731436
 ] 

shane knapp commented on SPARK-10456:
-

looks like we'll be installing 7u79 (we're at 7u51 currently).

> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest&greatest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731428#comment-14731428
 ] 

shane knapp commented on SPARK-10455:
-

looks like i'll be installing java 8u60.

> install java 8 on amplab jenkins workers
> 
>
> Key: SPARK-10455
> URL: https://issues.apache.org/jira/browse/SPARK-10455
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>
> install java 8 on all jenkins workers.
> and just for clarification:  we want the 64-bit version, yes?
> please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-10456:

Description: 
our java 7 installation is really old (from last september).  update this to 
the latest&greatest java 7 jdk.

please assign this to me.

  was:our java 7 installation is really old (from last september).  update this 
to the latest&greatest java 7 jdk


> upgrade java 7 on amplab jenkins workers
> 
>
> Key: SPARK-10456
> URL: https://issues.apache.org/jira/browse/SPARK-10456
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: shane knapp
>  Labels: build
>
> our java 7 installation is really old (from last september).  update this to 
> the latest&greatest java 7 jdk.
> please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10456) upgrade java 7 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

shane knapp created SPARK-10456:
---

 Summary: upgrade java 7 on amplab jenkins workers
 Key: SPARK-10456
 URL: https://issues.apache.org/jira/browse/SPARK-10456
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: shane knapp


our java 7 installation is really old (from last september).  update this to 
the latest&greatest java 7 jdk



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10455) install java 8 on amplab jenkins workers

2015-09-04 Thread shane knapp (JIRA)

shane knapp created SPARK-10455:
---

 Summary: install java 8 on amplab jenkins workers
 Key: SPARK-10455
 URL: https://issues.apache.org/jira/browse/SPARK-10455
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: shane knapp


install java 8 on all jenkins workers.

and just for clarification:  we want the 64-bit version, yes?

please assign this to me, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-09-04 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731415#comment-14731415
 ] 

Imran Rashid commented on SPARK-4105:
-

[~mvherweg] Do you know if the error occurred after there was already a stage 
retry?  If so, then this might just be a symptom of SPARK-8029.  You would know 
if earlier in the logs, you see a FetchFailedException which is *not* related 
to snappy exceptions.  I think that is the first report of this bug since 
SPARK-7660, which we were really hoping fixed this issue, so it would be great 
to capture more information about it.

[~mmitsuto] Can you do the same check, and also tell us which version of Spark 
you are using?

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
>

[jira] [Commented] (SPARK-10433) Gradient boosted trees

2015-09-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731398#comment-14731398
 ] 

Joseph K. Bradley commented on SPARK-10433:
---

Has this been reported on 1.5?  I've seen reports for 1.4, but was told by 
[~dbtsai] that 1.5 seems to have fixed this issue.  I believe that the caching 
(and optional checkpointing) added in 1.5 fix this issue, but it would be great 
to get confirmation.

> Gradient boosted trees
> --
>
> Key: SPARK-10433
> URL: https://issues.apache.org/jira/browse/SPARK-10433
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Sean Owen
>
> (Sorry to say I don't have any leads on a fix, but this was reported by three 
> different people and I confirmed it at fairly close range, so think it's 
> legitimate:)
> This is probably best explained in the words from the mailing list thread at 
> http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
>  . Matt Forbes says:
> {quote}
> I am training a boosted trees model on a couple million input samples (with 
> around 300 features) and am noticing that the input size of each stage is 
> increasing each iteration. For each new tree, the first step seems to be 
> building the decision tree metadata, which does a .count() on the input data, 
> so this is the step I've been using to track the input size changing. Here is 
> what I'm seeing: 
> {quote}
> {code}
> count at DecisionTreeMetadata.scala:111 
> 1. Input Size / Records: 726.1 MB / 1295620 
> 2. Input Size / Records: 106.9 GB / 64780816 
> 3. Input Size / Records: 160.3 GB / 97171224 
> 4. Input Size / Records: 214.8 GB / 129680959 
> 5. Input Size / Records: 268.5 GB / 162533424 
>  
> Input Size / Records: 1912.6 GB / 1382017686 
>  
> {code}
> {quote}
> This step goes from taking less than 10s up to 5 minutes by the 15th or so 
> iteration. I'm not quite sure what could be causing this. I am passing a 
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train 
> {quote}
> Johannes Bauer showed me a very similar problem.
> Peter Rudenko offers this sketch of a reproduction:
> {code}
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
> boostingStrategy.setNumIterations(30)
> boostingStrategy.setLearningRate(1.0)
> boostingStrategy.treeStrategy.setMaxDepth(3)
> boostingStrategy.treeStrategy.setMaxBins(128)
> boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
> boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
> boostingStrategy.treeStrategy.setUseNodeIdCache(true)
> boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>   
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
>  java.lang.Integer]])
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10439:


Assignee: Apache Spark

> Catalyst should check for overflow / underflow of date and timestamp values
> ---
>
> Key: SPARK-10439
> URL: https://issues.apache.org/jira/browse/SPARK-10439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> While testing some code, I noticed that a few methods in {{DateTimeUtils}} 
> are prone to overflow and underflow.
> For example, {{millisToDays}} can overflow the return type ({{Int}}) if a 
> large enough input value is provided.
> Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which 
> can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the 
> negative case).
> There might be others but these were the ones that caught my eye.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values

2015-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731392#comment-14731392
 ] 

Apache Spark commented on SPARK-10439:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8606

> Catalyst should check for overflow / underflow of date and timestamp values
> ---
>
> Key: SPARK-10439
> URL: https://issues.apache.org/jira/browse/SPARK-10439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While testing some code, I noticed that a few methods in {{DateTimeUtils}} 
> are prone to overflow and underflow.
> For example, {{millisToDays}} can overflow the return type ({{Int}}) if a 
> large enough input value is provided.
> Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which 
> can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the 
> negative case).
> There might be others but these were the ones that caught my eye.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10439:


Assignee: (was: Apache Spark)

> Catalyst should check for overflow / underflow of date and timestamp values
> ---
>
> Key: SPARK-10439
> URL: https://issues.apache.org/jira/browse/SPARK-10439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While testing some code, I noticed that a few methods in {{DateTimeUtils}} 
> are prone to overflow and underflow.
> For example, {{millisToDays}} can overflow the return type ({{Int}}) if a 
> large enough input value is provided.
> Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which 
> can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the 
> negative case).
> There might be others but these were the ones that caught my eye.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true

2015-09-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-10414.
-
Resolution: Duplicate

This looks like a duplicate of an existing JIRA and PR.  [~vinodkc], could you 
please close this and help review the existing PR?  Thanks!

> DenseMatrix gives different hashcode even though equals returns true
> 
>
> Key: SPARK-10414
> URL: https://issues.apache.org/jira/browse/SPARK-10414
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Vinod KC
>Priority: Minor
>
> hashcode implementation in DenseMatrix gives different result for same input
> val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
> assert(dm1 === dm) // passed
> assert(dm1.hashCode === dm.hashCode) // Failed
> This violates the hashCode/equals contract.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10454:


Assignee: (was: Apache Spark)

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Priority: Minor
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731380#comment-14731380
 ] 

Apache Spark commented on SPARK-10454:
--

User 'robbinspg' has created a pull request for this issue:
https://github.com/apache/spark/pull/8605

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Priority: Minor
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10454:


Assignee: Apache Spark

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731377#comment-14731377
 ] 

Pete Robbins commented on SPARK-10454:
--

This is another case of not waiting for events to drain form the listenerBus

> Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause 
> multiple concurrent attempts for the same map stage
> -
>
> Key: SPARK-10454
> URL: https://issues.apache.org/jira/browse/SPARK-10454
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.5.1
>Reporter: Pete Robbins
>Priority: Minor
>
> test case fails intermittently in Jenkins.
> For eg, see the following builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10454) Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch failures don't cause multiple concurrent attempts for the same map stage

2015-09-04 Thread Pete Robbins (JIRA)

Pete Robbins created SPARK-10454:


 Summary: Flaky test: o.a.s.scheduler.DAGSchedulerSuite.late fetch 
failures don't cause multiple concurrent attempts for the same map stage
 Key: SPARK-10454
 URL: https://issues.apache.org/jira/browse/SPARK-10454
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.5.1
Reporter: Pete Robbins
Priority: Minor


test case fails intermittently in Jenkins.

For eg, see the following builds-
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41991/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10452) Pyspark worker security issue

2015-09-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10452.

Resolution: Not A Problem

If you need your workers to run as you user, you need to configure YARN to use 
Kerberos.

> Pyspark worker security issue
> -
>
> Key: SPARK-10452
> URL: https://issues.apache.org/jira/browse/SPARK-10452
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0 running on hadoop 2.5.2.
>Reporter: Michael Procopio
>Priority: Critical
>
> The python worker launched by the executor is given the credentials used to 
> launch yarn. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark

2015-09-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10453.

Resolution: Not A Problem

>From http://spark.apache.org/docs/latest/running-on-yarn.html:

{noformat}
spark.yarn.executor.memoryOverhead

executorMemory * 0.10, with minimum of 384

The amount of off heap memory (in megabytes) to be allocated per executor. This 
is memory that accounts for things like VM overheads, interned strings, other 
native overheads, etc. This tends to grow with the executor size (typically 
6-10%).
{noformat}

That also counts encompasses the python workers.


> There's now way to use spark.dynmicAllocation.enabled with pyspark
> --
>
> Key: SPARK-10453
> URL: https://issues.apache.org/jira/browse/SPARK-10453
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
> Environment: When using spark.dynamicAllocation.enabled, the 
> assumption is that memory/core resources will be mediated by the yarn 
> resource manager.  Unfortunately, whatever value is used for 
> spark.executor.memory is consumed as JVM heap space by the executor.  There's 
> no way to account for the memory requirements of the pyspark worker.  
> Executor JVM heap space should be decoupled from spark.executor.memory.
>Reporter: Michael Procopio
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark

2015-09-04 Thread Michael Procopio (JIRA)

Michael Procopio created SPARK-10453:


 Summary: There's now way to use spark.dynmicAllocation.enabled 
with pyspark
 Key: SPARK-10453
 URL: https://issues.apache.org/jira/browse/SPARK-10453
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: When using spark.dynamicAllocation.enabled, the 
assumption is that memory/core resources will be mediated by the yarn resource 
manager.  Unfortunately, whatever value is used for spark.executor.memory is 
consumed as JVM heap space by the executor.  There's no way to account for the 
memory requirements of the pyspark worker.  Executor JVM heap space should be 
decoupled from spark.executor.memory.
Reporter: Michael Procopio






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9666) ML 1.5 QA: model save/load audit

2015-09-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731299#comment-14731299
 ] 

Joseph K. Bradley commented on SPARK-9666:
--

Thanks for checking.  Shall I mark this complete?

> ML 1.5 QA: model save/load audit
> 
>
> Key: SPARK-9666
> URL: https://issues.apache.org/jira/browse/SPARK-9666
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> We should check to make sure no changes broke model import/export in 
> spark.mllib.
> * If a model's name, data members, or constructors have changed _at all_, 
> then we likely need to support a new save/load format version.  Different 
> versions must be tested in unit tests to ensure backwards compatibility 
> (i.e., verify we can load old model formats).
> * Examples in the programming guide should include save/load when available.  
> It's important to try running each example in the guide whenever it is 
> modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10452) Pyspark worker security issue

2015-09-04 Thread Michael Procopio (JIRA)

Michael Procopio created SPARK-10452:


 Summary: Pyspark worker security issue
 Key: SPARK-10452
 URL: https://issues.apache.org/jira/browse/SPARK-10452
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 running on hadoop 2.5.2.
Reporter: Michael Procopio
Priority: Critical


The python worker launched by the executor is given the credentials used to 
launch yarn. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10451:


Assignee: (was: Apache Spark)

> Prevent unnecessary serializations in InMemoryColumnarTableScan
> ---
>
> Key: SPARK-10451
> URL: https://issues.apache.org/jira/browse/SPARK-10451
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yash Datta
>
> In InMemorycolumnarTableScan, seriliazation of certain fields like 
> buildFilter, InMemoryRelation etc can be avoided during task execution by 
> carefully managing the clsoure of mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan

2015-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731283#comment-14731283
 ] 

Apache Spark commented on SPARK-10451:
--

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/8604

> Prevent unnecessary serializations in InMemoryColumnarTableScan
> ---
>
> Key: SPARK-10451
> URL: https://issues.apache.org/jira/browse/SPARK-10451
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yash Datta
>
> In InMemorycolumnarTableScan, seriliazation of certain fields like 
> buildFilter, InMemoryRelation etc can be avoided during task execution by 
> carefully managing the clsoure of mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10451:


Assignee: Apache Spark

> Prevent unnecessary serializations in InMemoryColumnarTableScan
> ---
>
> Key: SPARK-10451
> URL: https://issues.apache.org/jira/browse/SPARK-10451
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yash Datta
>Assignee: Apache Spark
>
> In InMemorycolumnarTableScan, seriliazation of certain fields like 
> buildFilter, InMemoryRelation etc can be avoided during task execution by 
> carefully managing the clsoure of mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan

2015-09-04 Thread Yash Datta (JIRA)

Yash Datta created SPARK-10451:
--

 Summary: Prevent unnecessary serializations in 
InMemoryColumnarTableScan
 Key: SPARK-10451
 URL: https://issues.apache.org/jira/browse/SPARK-10451
 Project: Spark
  Issue Type: Improvement
Reporter: Yash Datta


In InMemorycolumnarTableScan, seriliazation of certain fields like buildFilter, 
InMemoryRelation etc can be avoided during task execution by carefully managing 
the clsoure of mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10450:


Assignee: Andrew Or  (was: Apache Spark)

> Minor SQL style, format, typo, readability fixes
> 
>
> Key: SPARK-10450
> URL: https://issues.apache.org/jira/browse/SPARK-10450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
> more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731211#comment-14731211
 ] 

Apache Spark commented on SPARK-10450:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/8603

> Minor SQL style, format, typo, readability fixes
> 
>
> Key: SPARK-10450
> URL: https://issues.apache.org/jira/browse/SPARK-10450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
> more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10450:


Assignee: Apache Spark  (was: Andrew Or)

> Minor SQL style, format, typo, readability fixes
> 
>
> Key: SPARK-10450
> URL: https://issues.apache.org/jira/browse/SPARK-10450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Minor
>
> This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
> more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10450) Minor SQL style, format, typo, readability fixes

2015-09-04 Thread Andrew Or (JIRA)

Andrew Or created SPARK-10450:
-

 Summary: Minor SQL style, format, typo, readability fixes
 Key: SPARK-10450
 URL: https://issues.apache.org/jira/browse/SPARK-10450
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


This JIRA isn't exactly tied to one particular patch. Like SPARK-10003 it's 
more of a continuous process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10449) StructType.merge shouldn't merge DecimalTypes with different precisions and/or scales

2015-09-04 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-10449:
--

 Summary: StructType.merge shouldn't merge DecimalTypes with 
different precisions and/or scales
 Key: SPARK-10449
 URL: https://issues.apache.org/jira/browse/SPARK-10449
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1, 1.3.1, 1.5.0
Reporter: Cheng Lian


Schema merging should only handle struct fields. But currently we also 
reconcile decimal precision and scale information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9925) Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests

2015-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731170#comment-14731170
 ] 

Apache Spark commented on SPARK-9925:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8602

> Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests
> --
>
> Key: SPARK-9925
> URL: https://issues.apache.org/jira/browse/SPARK-9925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, in our TestSQLContext/TestHiveContext, we use {{override def 
> numShufflePartitions: Int = this.getConf(SQLConf.SHUFFLE_PARTITIONS, 5)}} to 
> set {{SHUFFLE_PARTITIONS}}. However, we never put it to SQLConf. So, after we 
> use {{withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "number")}}, the number 
> of shuffle partitions will be set back to 200.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-04 Thread Justin Uang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731164#comment-14731164
 ] 

Justin Uang commented on SPARK-10447:
-

Agreed, I'm pretty sure that this will break some APIs and we'll have to fix 
those as we do the upgrade =).

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 126 matches

Mail list logo