[jira] [Created] (SPARK-10387) Code generation for decision tree

2015-09-01 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10387:
-

 Summary: Code generation for decision tree
 Key: SPARK-10387
 URL: https://issues.apache.org/jira/browse/SPARK-10387
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: DB Tsai


Provide code generation for decision tree and tree ensembles. Let's first 
discuss the design and then create new JIRAs for tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7132) Add fit with validation set to spark.ml GBT

2015-09-01 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724889#comment-14724889
 ] 

Yanbo Liang commented on SPARK-7132:


I will work on this issue.
[~josephkb]
I propose another way to resolve this issue.
The GBT Estimator remains take 1 input {DataFrame}, and we will split it into 
training and validation dataset internal.
Because the runWithValidation interface will take RDD[LabeledPoint] as input, 
it's easy to handle this.
And at the end of the GBT Estimator, we can also union these two dataset.

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7132) Add fit with validation set to spark.ml GBT

2015-09-01 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724889#comment-14724889
 ] 

Yanbo Liang edited comment on SPARK-7132 at 9/1/15 7:02 AM:


I will work on this issue.
[~josephkb]
I propose another way to resolve this issue.
The GBT Estimator remains take 1 input {code|DataFrame}, and we will split it 
into training and validation dataset internal.
Because the runWithValidation interface will take RDD[LabeledPoint] as input, 
it's easy to handle this.
And at the end of the GBT Estimator, we can also union these two dataset.


was (Author: yanboliang):
I will work on this issue.
[~josephkb]
I propose another way to resolve this issue.
The GBT Estimator remains take 1 input {DataFrame}, and we will split it into 
training and validation dataset internal.
Because the runWithValidation interface will take RDD[LabeledPoint] as input, 
it's easy to handle this.
And at the end of the GBT Estimator, we can also union these two dataset.

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7132) Add fit with validation set to spark.ml GBT

2015-09-01 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724889#comment-14724889
 ] 

Yanbo Liang edited comment on SPARK-7132 at 9/1/15 7:03 AM:


I will work on this issue.
[~josephkb]
I propose another way to resolve this issue.
The GBT Estimator remains take 1 input DataFrame, and we will split it into 
training and validation dataset internal.
Because the runWithValidation interface will take RDD[LabeledPoint] as input, 
it's easy to handle this.
And at the end of the GBT Estimator, we can also union these two dataset.


was (Author: yanboliang):
I will work on this issue.
[~josephkb]
I propose another way to resolve this issue.
The GBT Estimator remains take 1 input {code|DataFrame}, and we will split it 
into training and validation dataset internal.
Because the runWithValidation interface will take RDD[LabeledPoint] as input, 
it's easy to handle this.
And at the end of the GBT Estimator, we can also union these two dataset.

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10388) Public dataset loader interface

2015-09-01 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10388:
-

 Summary: Public dataset loader interface
 Key: SPARK-10388
 URL: https://issues.apache.org/jira/browse/SPARK-10388
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


It is very useful to have a public dataset loader to fetch ML datasets from 
popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
requirements, and initial implementation.

{code}
val loader = new DatasetLoader(sqlContext)
val df = loader.get("libsvm", "rcv1_train.binary")
{code}

User should be able to list (or preview) datasets, e.g.
{code}
val datasets = loader.ls("libsvm") // returns a local DataFrame
datasets.show() // list all datasets under libsvm repo
{code}

It would be nice to allow 3rd-party packages to register new repos. Both the 
API and implementation are pending discussion. Note that this requires http and 
https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10324:
--
Description: 
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6. Please view this list as a wish list rather than a concrete 
plan, because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add `@Since("1.6.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* log-linear model for survival analysis (SPARK-8518)
* normal equation approach for linear regression (SPARK-9834)
* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* bisecting k-means (SPARK-6517)
* weighted instance support (SPARK-9610)
  * logistic regression (SPARK-7685)
  * linear regression (SPARK-9642)
  * random forest (SPARK-9478)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-2352)
  * autoencoder (SPARK-4288)
  * restricted Boltzmann machine (RBM) (SPARK-4251)
  * convolutional neural network
* factorization machine (SPARK-7008)
* local linear algebra (SPARK-6442)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* univariate statistics as UDAFs (SPARK-10384)
* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* online hypothesis testing (SPARK-3147)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
* ML attribute API improvements (SPARK-8515)
* feature transformers (SPARK-9930)
  * feature interaction (SPARK-9698)
  * SQL transformer (SPARK-8345)
  * ??
* test Kaggle datasets (SPARK-9941)

h2. Model persistence

* PMML export
  * naive Bayes (SPARK-8546)
  * decision tree (SPARK-8542)
* model save/load
  * FPGrowth (SPARK-6724)
  * PrefixSpan (SPARK-10386)
* code generation
  * decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* LIBSVM data source (SPARK-10117)
* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API.

* Python API for new algorithms
* Python API for missing methods

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-9838, 
SPARK-9839, SPARK-9840)
* better R formula support (SPARK-9681)
* model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)

h2. Documentation

* re-organize user guide (SPARK-8517)
* @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
* automatically test example code in user guide (SPARK-10382)



  was:
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spar

[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10324:
--
Description: 
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6. Please view this list as a wish list rather than a concrete 
plan, because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add `@Since("1.6.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* log-linear model for survival analysis (SPARK-8518)
* normal equation approach for linear regression (SPARK-9834)
* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* bisecting k-means (SPARK-6517)
* weighted instance support (SPARK-9610)
* logistic regression (SPARK-7685)
* linear regression (SPARK-9642)
* random forest (SPARK-9478)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-2352)
* autoencoder (SPARK-4288)
* restricted Boltzmann machine (RBM) (SPARK-4251)
* convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* local linear algebra (SPARK-6442)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* univariate statistics as UDAFs (SPARK-10384)
* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* online hypothesis testing (SPARK-3147)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
* ML attribute API improvements (SPARK-8515)
* feature transformers (SPARK-9930)
* feature interaction (SPARK-9698)
* SQL transformer (SPARK-8345)
* ??
* test Kaggle datasets (SPARK-9941)

h2. Model persistence

* PMML export
* naive Bayes (SPARK-8546)
* decision tree (SPARK-8542)
* model save/load
* FPGrowth (SPARK-6724)
* PrefixSpan (SPARK-10386)
* code generation
* decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* LIBSVM data source (SPARK-10117)
* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API.

* Python API for new algorithms
* Python API for missing methods

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-9838, 
SPARK-9839, SPARK-9840)
* better R formula support (SPARK-9681)
* model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)

h2. Documentation

* re-organize user guide (SPARK-8517)
* @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
* automatically test example code in user guide (SPARK-10382)


  was:
Following SPARK-8445, we created this master list for ML

[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10324:
--
Description: 
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6. Please view this list as a wish list rather than a concrete 
plan, because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add `@Since("1.6.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* log-linear model for survival analysis (SPARK-8518)
* normal equation approach for linear regression (SPARK-9834)
* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* bisecting k-means (SPARK-6517)
* weighted instance support (SPARK-9610)
 * logistic regression (SPARK-7685)
 * linear regression (SPARK-9642)
 * random forest (SPARK-9478)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-2352)
* autoencoder (SPARK-4288)
* restricted Boltzmann machine (RBM) (SPARK-4251)
* convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* local linear algebra (SPARK-6442)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* univariate statistics as UDAFs (SPARK-10384)
* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* online hypothesis testing (SPARK-3147)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
* ML attribute API improvements (SPARK-8515)
* feature transformers (SPARK-9930)
* feature interaction (SPARK-9698)
* SQL transformer (SPARK-8345)
* ??
* test Kaggle datasets (SPARK-9941)

h2. Model persistence

* PMML export
* naive Bayes (SPARK-8546)
* decision tree (SPARK-8542)
* model save/load
* FPGrowth (SPARK-6724)
* PrefixSpan (SPARK-10386)
* code generation
* decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* LIBSVM data source (SPARK-10117)
* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API.

* Python API for new algorithms
* Python API for missing methods

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-9838, 
SPARK-9839, SPARK-9840)
* better R formula support (SPARK-9681)
* model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)

h2. Documentation

* re-organize user guide (SPARK-8517)
* @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
* automatically test example code in user guide (SPARK-10382)


  was:
Following SPARK-8445, we created this master list for

[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10324:
--
Priority: Blocker  (was: Critical)

> MLlib 1.6 Roadmap
> -
>
> Key: SPARK-10324
> URL: https://issues.apache.org/jira/browse/SPARK-10324
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Blocker
>
> Following SPARK-8445, we created this master list for MLlib features we plan 
> to have in Spark 1.6. Please view this list as a wish list rather than a 
> concrete plan, because we don't have an accurate estimate of available 
> resources. Due to limited review bandwidth, features appearing on this list 
> will get higher priority during code review. But feel free to suggest new 
> items to the list in comments. We are experimenting with this process. Your 
> feedback would be greatly appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add `@Since("1.6.0")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if necessary.
> h1. Roadmap (WIP)
> This is NOT [a complete list of MLlib JIRAs for 
> 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
> umbrella JIRAs and high-level tasks.
> h2. Algorithms and performance
> * log-linear model for survival analysis (SPARK-8518)
> * normal equation approach for linear regression (SPARK-9834)
> * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
> * robust linear regression with Huber loss (SPARK-3181)
> * vector-free L-BFGS (SPARK-10078)
> * tree partition by features (SPARK-3717)
> * bisecting k-means (SPARK-6517)
> * weighted instance support (SPARK-9610)
> ** logistic regression (SPARK-7685)
> ** linear regression (SPARK-9642)
> ** random forest (SPARK-9478)
> * locality sensitive hashing (LSH) (SPARK-5992)
> * deep learning (SPARK-2352)
> ** autoencoder (SPARK-4288)
> ** restricted Boltzmann machine (RBM) (SPARK-4251)
> ** convolutional neural network (stretch)
> * factorization machine (SPARK-7008)
> * local linear algebra (SPARK-6442)
> * distributed LU decomposition (SPARK-8514)
> h2. Statistics
> * univariate statistics as UDAFs (SPARK-10384)
> * bivariate statistics as UDAFs (SPARK-10385)
> * R-like statistics for GLMs (SPARK-9835)
> * online hypothesis testing (SPARK-3147)
> h2. Pipeline API
> * pipeline persistence (SPARK-6725)
> * ML attribute API improvements (SPARK-8515)
> * feature transformers (SPARK-9930)
> ** feature interaction (SPARK-9698)
> ** SQL transformer (SPARK-8345)
> ** ??
> * test Kaggle datasets (SPARK-9941)
> h2. Model persistence
> * PMML export
> ** naive Bayes (SPARK-8546)
> ** decision tree (SPARK-8542)
> * model save/load
> ** FPGrowth (SPARK-6724)
> ** PrefixSpan (SPARK-10386)
> * code generation
> ** decision tree and tree ensembles (SPARK-10387)
> h2. Data sources
> * LIBSVM data source (SPARK-10117)
> * public dataset loader (SPARK-10388)
> h2. Python API for ML
> The main goal of Python API is to have feature parity with Scala/Java API.
> * Python API for new algorithms
> * Python API for missing metho

[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10324:
--
Description: 
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6. Please view this list as a wish list rather than a concrete 
plan, because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add `@Since("1.6.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* log-linear model for survival analysis (SPARK-8518)
* normal equation approach for linear regression (SPARK-9834)
* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* bisecting k-means (SPARK-6517)
* weighted instance support (SPARK-9610)
** logistic regression (SPARK-7685)
** linear regression (SPARK-9642)
** random forest (SPARK-9478)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-2352)
** autoencoder (SPARK-4288)
** restricted Boltzmann machine (RBM) (SPARK-4251)
** convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* local linear algebra (SPARK-6442)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* univariate statistics as UDAFs (SPARK-10384)
* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* online hypothesis testing (SPARK-3147)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
* ML attribute API improvements (SPARK-8515)
* feature transformers (SPARK-9930)
** feature interaction (SPARK-9698)
** SQL transformer (SPARK-8345)
** ??
* test Kaggle datasets (SPARK-9941)

h2. Model persistence

* PMML export
** naive Bayes (SPARK-8546)
** decision tree (SPARK-8542)
* model save/load
** FPGrowth (SPARK-6724)
** PrefixSpan (SPARK-10386)
* code generation
** decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* LIBSVM data source (SPARK-10117)
* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API.

* Python API for new algorithms
* Python API for missing methods

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-9838, 
SPARK-9839, SPARK-9840)
* better R formula support (SPARK-9681)
* model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)

h2. Documentation

* re-organize user guide (SPARK-8517)
* @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
* automatically test example code in user guide (SPARK-10382)


  was:
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6

[jira] [Commented] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-01 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724899#comment-14724899
 ] 

Xiangrui Meng commented on SPARK-10324:
---

Changed priority to blocker to make this list more discoverable.

> MLlib 1.6 Roadmap
> -
>
> Key: SPARK-10324
> URL: https://issues.apache.org/jira/browse/SPARK-10324
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Blocker
>
> Following SPARK-8445, we created this master list for MLlib features we plan 
> to have in Spark 1.6. Please view this list as a wish list rather than a 
> concrete plan, because we don't have an accurate estimate of available 
> resources. Due to limited review bandwidth, features appearing on this list 
> will get higher priority during code review. But feel free to suggest new 
> items to the list in comments. We are experimenting with this process. Your 
> feedback would be greatly appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add `@Since("1.6.0")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if necessary.
> h1. Roadmap (WIP)
> This is NOT [a complete list of MLlib JIRAs for 
> 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
> umbrella JIRAs and high-level tasks.
> h2. Algorithms and performance
> * log-linear model for survival analysis (SPARK-8518)
> * normal equation approach for linear regression (SPARK-9834)
> * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
> * robust linear regression with Huber loss (SPARK-3181)
> * vector-free L-BFGS (SPARK-10078)
> * tree partition by features (SPARK-3717)
> * bisecting k-means (SPARK-6517)
> * weighted instance support (SPARK-9610)
> ** logistic regression (SPARK-7685)
> ** linear regression (SPARK-9642)
> ** random forest (SPARK-9478)
> * locality sensitive hashing (LSH) (SPARK-5992)
> * deep learning (SPARK-2352)
> ** autoencoder (SPARK-4288)
> ** restricted Boltzmann machine (RBM) (SPARK-4251)
> ** convolutional neural network (stretch)
> * factorization machine (SPARK-7008)
> * local linear algebra (SPARK-6442)
> * distributed LU decomposition (SPARK-8514)
> h2. Statistics
> * univariate statistics as UDAFs (SPARK-10384)
> * bivariate statistics as UDAFs (SPARK-10385)
> * R-like statistics for GLMs (SPARK-9835)
> * online hypothesis testing (SPARK-3147)
> h2. Pipeline API
> * pipeline persistence (SPARK-6725)
> * ML attribute API improvements (SPARK-8515)
> * feature transformers (SPARK-9930)
> ** feature interaction (SPARK-9698)
> ** SQL transformer (SPARK-8345)
> ** ??
> * test Kaggle datasets (SPARK-9941)
> h2. Model persistence
> * PMML export
> ** naive Bayes (SPARK-8546)
> ** decision tree (SPARK-8542)
> * model save/load
> ** FPGrowth (SPARK-6724)
> ** PrefixSpan (SPARK-10386)
> * code generation
> ** decision tree and tree ensembles (SPARK-10387)
> h2. Data sources
> * LIBSVM data source (SPARK-10117)
> * public dataset loader (SPARK-10388)
> h2. Python API for ML
> The main goal of Python API is to have feature parity wit

[jira] [Created] (SPARK-10389) support order by non-attribute grouping expression on Aggregate

2015-09-01 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-10389:
---

 Summary: support order by non-attribute grouping expression on 
Aggregate
 Key: SPARK-10389
 URL: https://issues.apache.org/jira/browse/SPARK-10389
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10389) support order by non-attribute grouping expression on Aggregate

2015-09-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10389:

Description: For example, we should support "SELECT MAX(value) FROM src 
GROUP BY key + 1 ORDER BY key + 1".

> support order by non-attribute grouping expression on Aggregate
> ---
>
> Key: SPARK-10389
> URL: https://issues.apache.org/jira/browse/SPARK-10389
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 
> ORDER BY key + 1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10389) support order by non-attribute grouping expression on Aggregate

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724905#comment-14724905
 ] 

Apache Spark commented on SPARK-10389:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8548

> support order by non-attribute grouping expression on Aggregate
> ---
>
> Key: SPARK-10389
> URL: https://issues.apache.org/jira/browse/SPARK-10389
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 
> ORDER BY key + 1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10389) support order by non-attribute grouping expression on Aggregate

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10389:


Assignee: (was: Apache Spark)

> support order by non-attribute grouping expression on Aggregate
> ---
>
> Key: SPARK-10389
> URL: https://issues.apache.org/jira/browse/SPARK-10389
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 
> ORDER BY key + 1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10389) support order by non-attribute grouping expression on Aggregate

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10389:


Assignee: Apache Spark

> support order by non-attribute grouping expression on Aggregate
> ---
>
> Key: SPARK-10389
> URL: https://issues.apache.org/jira/browse/SPARK-10389
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 
> ORDER BY key + 1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-01 Thread JIRA
Zoltán Zvara created SPARK-10390:


 Summary: Py4JJavaError java.lang.NoSuchMethodError: 
com.google.common.base.Stopwatch.elapsedMillis()J
 Key: SPARK-10390
 URL: https://issues.apache.org/jira/browse/SPARK-10390
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Zoltán Zvara


While running PySpark through iPython.

{code}
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at 
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
at 
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

{{spark-env.sh}}
{code}
export IPYTHON=1
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10391) Spark 1.4.1 released news under news/spark-1-3-1-released.html

2015-09-01 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-10391:
---

 Summary: Spark 1.4.1 released news under 
news/spark-1-3-1-released.html
 Key: SPARK-10391
 URL: https://issues.apache.org/jira/browse/SPARK-10391
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Jacek Laskowski
Priority: Minor


The link to the news "Spark 1.4.1 released" is under 
http://spark.apache.org/news/spark-1-3-1-released.html. It's certainly 
inconsistent with the other news.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10261) Add @Since annotation to ml.evaluation

2015-09-01 Thread Tijo Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724975#comment-14724975
 ] 

Tijo Thomas commented on SPARK-10261:
-

I am working on this issue. 

> Add @Since annotation to ml.evaluation
> --
>
> Key: SPARK-10261
> URL: https://issues.apache.org/jira/browse/SPARK-10261
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10392) Pyspark - Wrong DateType support

2015-09-01 Thread JIRA
Maciej Bryński created SPARK-10392:
--

 Summary: Pyspark - Wrong DateType support
 Key: SPARK-10392
 URL: https://issues.apache.org/jira/browse/SPARK-10392
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Maciej Bryński


I have following problem.
I created table.

{code}
CREATE TABLE `spark_test` (
`id` INT(11) NULL,
`date` DATE NULL
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
{code}

Then I'm trying to read data and date '1970-01-01' is converted to int. This 
makes rdd incompatible with its own schema.

{code}
df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
'spark_test')
print(df.collect())
df = sqlCtx.createDataFrame(df.rdd, df.schema)

[Row(id=1, date=0)]
---
TypeError Traceback (most recent call last)
 in ()
  1 df = 
sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
 'spark_test')
  2 print(df.collect())
> 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)

/mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio)
402 
403 if isinstance(data, RDD):
--> 404 rdd, schema = self._createFromRDD(data, schema, 
samplingRatio)
405 else:
406 rdd, schema = self._createFromLocal(data, schema)

/mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
schema, samplingRatio)
296 rows = rdd.take(10)
297 for row in rows:
--> 298 _verify_type(row, schema)
299 
300 else:

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1152  "length of fields (%d)" % (len(obj), 
len(dataType.fields)))
   1153 for v, f in zip(obj, dataType.fields):
-> 1154 _verify_type(v, f.dataType)
   1155 
   1156 

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1136 # subclass of them can not be fromInternald in JVM
   1137 if type(obj) not in _acceptable_types[_type]:
-> 1138 raise TypeError("%s can not accept object in type %s" % 
(dataType, type(obj)))
   1139 
   1140 if isinstance(dataType, ArrayType):

TypeError: DateType can not accept object in type 

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7770) Should GBT validationTol be relative tolerance?

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7770:
---

Assignee: (was: Apache Spark)

> Should GBT validationTol be relative tolerance?
> ---
>
> Key: SPARK-7770
> URL: https://issues.apache.org/jira/browse/SPARK-7770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib, GBT validationTol uses absolute tolerance.  Relative 
> tolerance is arguably easier to set in a meaningful way.  Questions:
> * Should we change spark.mllib's validationTol meaning?
> * Should we use relative tolerance in spark.ml's GBT (once we add validation 
> support)?
> I would vote for changing both to relative tolerance, where the tolerance is 
> relative to the current loss on the training set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7770) Should GBT validationTol be relative tolerance?

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725013#comment-14725013
 ] 

Apache Spark commented on SPARK-7770:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8549

> Should GBT validationTol be relative tolerance?
> ---
>
> Key: SPARK-7770
> URL: https://issues.apache.org/jira/browse/SPARK-7770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib, GBT validationTol uses absolute tolerance.  Relative 
> tolerance is arguably easier to set in a meaningful way.  Questions:
> * Should we change spark.mllib's validationTol meaning?
> * Should we use relative tolerance in spark.ml's GBT (once we add validation 
> support)?
> I would vote for changing both to relative tolerance, where the tolerance is 
> relative to the current loss on the training set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7770) Should GBT validationTol be relative tolerance?

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7770:
---

Assignee: Apache Spark

> Should GBT validationTol be relative tolerance?
> ---
>
> Key: SPARK-7770
> URL: https://issues.apache.org/jira/browse/SPARK-7770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> In spark.mllib, GBT validationTol uses absolute tolerance.  Relative 
> tolerance is arguably easier to set in a meaningful way.  Questions:
> * Should we change spark.mllib's validationTol meaning?
> * Should we use relative tolerance in spark.ml's GBT (once we add validation 
> support)?
> I would vote for changing both to relative tolerance, where the tolerance is 
> relative to the current loss on the training set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-09-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-10301.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8509
[https://github.com/apache/spark/pull/8509]

> For struct type, if parquet's global schema has less fields than a file's 
> schema, data reading will fail
> 
>
> Key: SPARK-10301
> URL: https://issues.apache.org/jira/browse/SPARK-10301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> We hit this issue when reading a complex Parquet dateset without turning on 
> schema merging.  The data set consists of Parquet files with different but 
> compatible schemas.  In this way, the schema of the dataset is defined by 
> either a summary file or a random physical Parquet file if no summary files 
> are available.  Apparently, this schema may not containing all fields 
> appeared in all physicla files.
> Parquet was designed with schema evolution and column pruning in mind, so it 
> should be legal for a user to use a tailored schema to read the dataset to 
> save disk IO.  For example, say we have a Parquet dataset consisting of two 
> physical Parquet files with the following two schemas:
> {noformat}
> message m0 {
>   optional group f0 {
> optional int64 f00;
> optional int64 f01;
>   }
> }
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f01;
> optional int64 f02;
>   }
>   optional double f1;
> }
> {noformat}
> Users should be allowed to read the dataset with the following schema:
> {noformat}
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f02;
>   }
> }
> {noformat}
> so that {{f0.f00}} and {{f1}} are never touched.  The above case can be 
> expressed by the following {{spark-shell}} snippet:
> {noformat}
> import sqlContext._
> import sqlContext.implicits._
> import org.apache.spark.sql.types.{LongType, StructType}
> val path = "/tmp/spark/parquet"
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id) AS f0").coalesce(1)
> .write.mode("overwrite").parquet(path)
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0", 
> "CAST(id AS DOUBLE) AS f1").coalesce(1)
> .write.mode("append").parquet(path)
> val tailoredSchema =
>   new StructType()
> .add(
>   "f0",
>   new StructType()
> .add("f01", LongType, nullable = true)
> .add("f02", LongType, nullable = true),
>   nullable = true)
> read.schema(tailoredSchema).parquet(path).show()
> {noformat}
> Expected output should be:
> {noformat}
> ++
> |  f0|
> ++
> |[0,null]|
> |[1,null]|
> |[2,null]|
> |   [0,0]|
> |   [1,1]|
> |   [2,2]|
> ++
> {noformat}
> However, current 1.5-SNAPSHOT version throws the following exception:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  

[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support

2015-09-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10392:
---
Affects Version/s: 1.4.1

> Pyspark - Wrong DateType support
> 
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data and date '1970-01-01' is converted to int. This 
> makes rdd incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10370:


Assignee: (was: Apache Spark)

> After a stages map outputs are registered, all running attempts should be 
> marked as zombies
> ---
>
> Key: SPARK-10370
> URL: https://issues.apache.org/jira/browse/SPARK-10370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>
> Follow up to SPARK-5259.  During stage retry, its possible for a stage to 
> "complete" by registering all its map output and starting the downstream 
> stages, before the latest task set has completed.  This will result in the 
> earlier task set continuing to submit tasks, that are both unnecessary and 
> increase the chance of hitting SPARK-8029.
> Spark should mark all tasks sets for a stage as zombie as soon as its map 
> output is registered.  Note that this involves coordination between the 
> various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at 
> least) which isn't easily testable with the current setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725043#comment-14725043
 ] 

Apache Spark commented on SPARK-10370:
--

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/8550

> After a stages map outputs are registered, all running attempts should be 
> marked as zombies
> ---
>
> Key: SPARK-10370
> URL: https://issues.apache.org/jira/browse/SPARK-10370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>
> Follow up to SPARK-5259.  During stage retry, its possible for a stage to 
> "complete" by registering all its map output and starting the downstream 
> stages, before the latest task set has completed.  This will result in the 
> earlier task set continuing to submit tasks, that are both unnecessary and 
> increase the chance of hitting SPARK-8029.
> Spark should mark all tasks sets for a stage as zombie as soon as its map 
> output is registered.  Note that this involves coordination between the 
> various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at 
> least) which isn't easily testable with the current setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10370:


Assignee: Apache Spark

> After a stages map outputs are registered, all running attempts should be 
> marked as zombies
> ---
>
> Key: SPARK-10370
> URL: https://issues.apache.org/jira/browse/SPARK-10370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>
> Follow up to SPARK-5259.  During stage retry, its possible for a stage to 
> "complete" by registering all its map output and starting the downstream 
> stages, before the latest task set has completed.  This will result in the 
> earlier task set continuing to submit tasks, that are both unnecessary and 
> increase the chance of hitting SPARK-8029.
> Spark should mark all tasks sets for a stage as zombie as soon as its map 
> output is registered.  Note that this involves coordination between the 
> various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at 
> least) which isn't easily testable with the current setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10246) Join in PySpark using a list of column names

2015-09-01 Thread Alexey Grishchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725045#comment-14725045
 ] 

Alexey Grishchenko commented on SPARK-10246:


Cannot reproduce, all the options with multiple conditions work on master 
branch:
{code}
>>> df.join(df4, ['name', 'age']).collect()
[Row(age=5, name=u'Bob', height=None)]
>>> df.join(df4, (df.name == df4.name) & (df.age == df4.age)).collect()
[Row(age=5, name=u'Bob', age=5, height=None, name=u'Bob')]
>>> cond = [df.name == df4.name, df.age == df4.age]
>>> df.join(df4, cond).collect()
Row(age=5, name=u'Bob', age=5, height=None, name=u'Bob')]
{code}

> Join in PySpark using a list of column names
> 
>
> Key: SPARK-10246
> URL: https://issues.apache.org/jira/browse/SPARK-10246
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Michal Monselise
>
> Currently, there are two supported methods to perform a join: join condition 
> and one column name.
> The documentation specifies that the join function can accept a list of 
> conditions or a list of column names but neither are currently supported. 
> This is discussed in issue SPARK-7197 as well.
> Functionality should match the documentation which currently contains an 
> example in /spark/python/pyspark/sql/dataframe.py line 560:
> >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
> [Row(name=u'Bob', age=5)]
> """



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Zvara updated SPARK-10390:
-
Description: 
While running PySpark through iPython.

{code}
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at 
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
at 
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

{{spark-env.sh}}
{code}
export IPYTHON=1
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
{code}

Spark built with:
{{build/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly --error}}

  was:
While running PySpark through iPython.

{code}
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:

[jira] [Updated] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Zvara updated SPARK-10390:
-
Description: 
While running PySpark through iPython.

{code}
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at 
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
at 
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

{{spark-env.sh}}
{code}
export IPYTHON=1
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
{code}

Spark built with:
{{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}

  was:
While running PySpark through iPython.

{code}
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:

[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support

2015-09-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10392:
---
Description: 
I have following problem.
I created table.

{code}
CREATE TABLE `spark_test` (
`id` INT(11) NULL,
`date` DATE NULL
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
{code}

Then I'm trying to read data - date '1970-01-01' is converted to int. This 
makes data frame incompatible with its own schema.

{code}
df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
'spark_test')
print(df.collect())
df = sqlCtx.createDataFrame(df.rdd, df.schema)

[Row(id=1, date=0)]
---
TypeError Traceback (most recent call last)
 in ()
  1 df = 
sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
 'spark_test')
  2 print(df.collect())
> 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)

/mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio)
402 
403 if isinstance(data, RDD):
--> 404 rdd, schema = self._createFromRDD(data, schema, 
samplingRatio)
405 else:
406 rdd, schema = self._createFromLocal(data, schema)

/mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
schema, samplingRatio)
296 rows = rdd.take(10)
297 for row in rows:
--> 298 _verify_type(row, schema)
299 
300 else:

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1152  "length of fields (%d)" % (len(obj), 
len(dataType.fields)))
   1153 for v, f in zip(obj, dataType.fields):
-> 1154 _verify_type(v, f.dataType)
   1155 
   1156 

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1136 # subclass of them can not be fromInternald in JVM
   1137 if type(obj) not in _acceptable_types[_type]:
-> 1138 raise TypeError("%s can not accept object in type %s" % 
(dataType, type(obj)))
   1139 
   1140 if isinstance(dataType, ArrayType):

TypeError: DateType can not accept object in type 

{code}

  was:
I have following problem.
I created table.

{code}
CREATE TABLE `spark_test` (
`id` INT(11) NULL,
`date` DATE NULL
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
{code}

Then I'm trying to read data - date '1970-01-01' is converted to int. This 
makes rdd incompatible with its own schema.

{code}
df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
'spark_test')
print(df.collect())
df = sqlCtx.createDataFrame(df.rdd, df.schema)

[Row(id=1, date=0)]
---
TypeError Traceback (most recent call last)
 in ()
  1 df = 
sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
 'spark_test')
  2 print(df.collect())
> 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)

/mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio)
402 
403 if isinstance(data, RDD):
--> 404 rdd, schema = self._createFromRDD(data, schema, 
samplingRatio)
405 else:
406 rdd, schema = self._createFromLocal(data, schema)

/mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
schema, samplingRatio)
296 rows = rdd.take(10)
297 for row in rows:
--> 298 _verify_type(row, schema)
299 
300 else:

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1152  "length of fields (%d)" % (len(obj), 
len(dataType.fields)))
   1153 for v, f in zip(obj, dataType.fields):
-> 1154 _verify_type(v, f.dataType)
   1155 
   1156 

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1136 # subclass of them can not be fromInternald in JVM
   1137 if type(obj) not in _acceptable_types[_type]:
-> 1138 raise TypeError("%s can not accept object in type %s" % 
(dataType, type(obj)))
   1139 
   1140 if isinstance(dataType, ArrayType):

TypeError: DateType can not accept object in type 

{code}


> Pyspark - Wrong DateType support
> 
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions

[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support

2015-09-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10392:
---
Description: 
I have following problem.
I created table.

{code}
CREATE TABLE `spark_test` (
`id` INT(11) NULL,
`date` DATE NULL
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
{code}

Then I'm trying to read data - date '1970-01-01' is converted to int. This 
makes rdd incompatible with its own schema.

{code}
df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
'spark_test')
print(df.collect())
df = sqlCtx.createDataFrame(df.rdd, df.schema)

[Row(id=1, date=0)]
---
TypeError Traceback (most recent call last)
 in ()
  1 df = 
sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
 'spark_test')
  2 print(df.collect())
> 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)

/mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio)
402 
403 if isinstance(data, RDD):
--> 404 rdd, schema = self._createFromRDD(data, schema, 
samplingRatio)
405 else:
406 rdd, schema = self._createFromLocal(data, schema)

/mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
schema, samplingRatio)
296 rows = rdd.take(10)
297 for row in rows:
--> 298 _verify_type(row, schema)
299 
300 else:

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1152  "length of fields (%d)" % (len(obj), 
len(dataType.fields)))
   1153 for v, f in zip(obj, dataType.fields):
-> 1154 _verify_type(v, f.dataType)
   1155 
   1156 

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1136 # subclass of them can not be fromInternald in JVM
   1137 if type(obj) not in _acceptable_types[_type]:
-> 1138 raise TypeError("%s can not accept object in type %s" % 
(dataType, type(obj)))
   1139 
   1140 if isinstance(dataType, ArrayType):

TypeError: DateType can not accept object in type 

{code}

  was:
I have following problem.
I created table.

{code}
CREATE TABLE `spark_test` (
`id` INT(11) NULL,
`date` DATE NULL
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
{code}

Then I'm trying to read data and date '1970-01-01' is converted to int. This 
makes rdd incompatible with its own schema.

{code}
df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
'spark_test')
print(df.collect())
df = sqlCtx.createDataFrame(df.rdd, df.schema)

[Row(id=1, date=0)]
---
TypeError Traceback (most recent call last)
 in ()
  1 df = 
sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
 'spark_test')
  2 print(df.collect())
> 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)

/mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio)
402 
403 if isinstance(data, RDD):
--> 404 rdd, schema = self._createFromRDD(data, schema, 
samplingRatio)
405 else:
406 rdd, schema = self._createFromLocal(data, schema)

/mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
schema, samplingRatio)
296 rows = rdd.take(10)
297 for row in rows:
--> 298 _verify_type(row, schema)
299 
300 else:

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1152  "length of fields (%d)" % (len(obj), 
len(dataType.fields)))
   1153 for v, f in zip(obj, dataType.fields):
-> 1154 _verify_type(v, f.dataType)
   1155 
   1156 

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1136 # subclass of them can not be fromInternald in JVM
   1137 if type(obj) not in _acceptable_types[_type]:
-> 1138 raise TypeError("%s can not accept object in type %s" % 
(dataType, type(obj)))
   1139 
   1140 if isinstance(dataType, ArrayType):

TypeError: DateType can not accept object in type 

{code}


> Pyspark - Wrong DateType support
> 
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4

[jira] [Commented] (SPARK-7544) pyspark.sql.types.Row should implement __getitem__

2015-09-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725066#comment-14725066
 ] 

Maciej Bryński commented on SPARK-7544:
---

Will this PR be added to spark ?

> pyspark.sql.types.Row should implement __getitem__
> --
>
> Key: SPARK-7544
> URL: https://issues.apache.org/jira/browse/SPARK-7544
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following from the related discussions in [SPARK-7505] and [SPARK-7133], the 
> {{Row}} type should implement {{\_\_getitem\_\_}} so that people can do this
> {code}
> row['field']
> {code}
> instead of this:
> {code}
> row.field
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Zvara updated SPARK-10390:
-
Description: 
While running PySpark through iPython.

{code}
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at 
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
at 
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

{{spark-env.sh}}
{code}
export IPYTHON=1
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
{code}

Spark built with:
{{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}

Not a problem, when built against {{Hadoop 2.4}}!

  was:
While running PySpark through iPython.

{code}
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apach

[jira] [Resolved] (SPARK-10391) Spark 1.4.1 released news under news/spark-1-3-1-released.html

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10391.
---
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 1.5.0

Fixed and pushed a revision to the site. Make sure to refresh in your browser 
to get the new HTML with the fixed link for the "Spark 1.4.1 released" news 
item.

> Spark 1.4.1 released news under news/spark-1-3-1-released.html
> --
>
> Key: SPARK-10391
> URL: https://issues.apache.org/jira/browse/SPARK-10391
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.1
>Reporter: Jacek Laskowski
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
>
> The link to the news "Spark 1.4.1 released" is under 
> http://spark.apache.org/news/spark-1-3-1-released.html. It's certainly 
> inconsistent with the other news.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10393) use ML pipeline in LDA example

2015-09-01 Thread yuhao yang (JIRA)
yuhao yang created SPARK-10393:
--

 Summary: use ML pipeline in LDA example
 Key: SPARK-10393
 URL: https://issues.apache.org/jira/browse/SPARK-10393
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Priority: Minor


Since the logic of the text processing part has been moved to ML 
estimators/transformers, replace the related code in LDA Example with the ML 
pipeline. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725073#comment-14725073
 ] 

Sean Owen commented on SPARK-10390:
---

This means you've pulled in a later version of Guava. Make sure you didn't 
package anything >14 with your app, perhaps by accidentally bringing in Hadoop 
deps. I don't think this is a Spark problem (at least, not given the history of 
why Guava can't be entirely shaded etc)

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10314:
--
Fix Version/s: (was: 1.6.0)

[~wangxiaoyu] Don't set Fix version; it's not resolved.

> [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception 
> when parallelism is big than data split size
> 
>
> Key: SPARK-10314
> URL: https://issues.apache.org/jira/browse/SPARK-10314
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.4.1
> Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4
>Reporter: Xiaoyu Wang
>Priority: Minor
>
> RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when 
> parallelism is big than data split size
> {code}
> val rdd = sc.parallelize(List(1, 2),2)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> is ok.
> {code}
> val rdd = sc.parallelize(List(1, 2),3)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> got exceptoin:
> {noformat}
> 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24
> 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 
> output partitions (allowLocal=false)
> 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at 
> :24)
> 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 
> (ParallelCollectionRDD[0] at parallelize at :21), which has no 
> missing parents
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with 
> curMem=0, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 1096.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with 
> curMem=1096, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 788.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43776 (size: 788.0 B, free: 706.9 MB)
> 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:874
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from 
> ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21)
> 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1269 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
> 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it
> 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started
> 15/08/27 17:53:08 WARN : tachyon.home is not set. Using 
> /mnt/tachyon_default_home as the default value.
> 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect 
> master @ localhost/127.0.0.1:19998
> 15/08/27 17:53:08 INFO : User registered at the master 
> localhost/127.0.0.1:19998 got UserId 109
> 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at 
> /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5
> 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost
> 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998
> 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was 
> created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 
> was created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 
> was created!
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore 
> on localhost:43776 (size: 0.0 B)
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore 
> on localhost:43776 (size: 2.0 B)
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_2 on ExternalBlockStore 
> on localhost:43776 (size: 2.0 B)
> 15/08/27 17:53:08 INFO BlockManager: Found block rdd_0_

[jira] [Commented] (SPARK-10393) use ML pipeline in LDA example

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725078#comment-14725078
 ] 

Apache Spark commented on SPARK-10393:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/8551

> use ML pipeline in LDA example
> --
>
> Key: SPARK-10393
> URL: https://issues.apache.org/jira/browse/SPARK-10393
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Minor
>
> Since the logic of the text processing part has been moved to ML 
> estimators/transformers, replace the related code in LDA Example with the ML 
> pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10393) use ML pipeline in LDA example

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10393:


Assignee: (was: Apache Spark)

> use ML pipeline in LDA example
> --
>
> Key: SPARK-10393
> URL: https://issues.apache.org/jira/browse/SPARK-10393
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Minor
>
> Since the logic of the text processing part has been moved to ML 
> estimators/transformers, replace the related code in LDA Example with the ML 
> pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10393) use ML pipeline in LDA example

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10393:


Assignee: Apache Spark

> use ML pipeline in LDA example
> --
>
> Key: SPARK-10393
> URL: https://issues.apache.org/jira/browse/SPARK-10393
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>
> Since the logic of the text processing part has been moved to ML 
> estimators/transformers, replace the related code in LDA Example with the ML 
> pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9089.
--
Resolution: Cannot Reproduce

> Failing to run simple job on Spark Standalone Cluster
> -
>
> Key: SPARK-9089
> URL: https://issues.apache.org/jira/browse/SPARK-9089
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: Staging
>Reporter: Amar Goradia
>Priority: Critical
>
> We are trying out Spark and as part of that, we have setup Standalone Spark 
> Cluster. As part of testing things out, we simple open PySpark shell and ran 
> this simple job: a=sc.parallelize([1,2,3]).count()
> As a result, we are getting errors. We tried googling around this error but 
> haven't been able to find exact reasoning behind why we are running into this 
> state. Can somebody please help us further look into this issue and advise us 
> on what we are missing here?
> Here is full error stack:
> >>> a=sc.parallelize([1,2,3]).count()
> 15/07/16 00:52:15 INFO SparkContext: Starting job: count at :1
> 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at :1) with 2 
> output partitions (allowLocal=false)
> 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at 
> :1)
> 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List()
> 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List()
> 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] 
> at count at :1), which has no missing parents
> 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5
> 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at :1) 
> failed in Unknown s
> 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at :1, took 
> 0.004963 s
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py", line 
> 972, in count
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py", line 
> 963, in sum
> return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
>   File "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py", line 
> 771, in reduce
> vals = self.mapPartitions(func).collect()
>   File "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py", line 
> 745, in collect
> port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>   File 
> "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.lang.reflect.InvocationTargetException
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:80)
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874)
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419)
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
>   at 
> o

[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support

2015-09-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10392:
---
Description: 
I have following problem.
I created table.

{code}
CREATE TABLE `spark_test` (
`id` INT(11) NULL,
`date` DATE NULL
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
{code}

Then I'm trying to read data - date '1970-01-01' is converted to int. This 
makes data frame incompatible with its own schema.

{code}
df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
'spark_test')
print(df.collect())
df = sqlCtx.createDataFrame(df.rdd, df.schema)

[Row(id=1, date=0)]
---
TypeError Traceback (most recent call last)
 in ()
  1 df = 
sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
 'spark_test')
  2 print(df.collect())
> 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)

/mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio)
402 
403 if isinstance(data, RDD):
--> 404 rdd, schema = self._createFromRDD(data, schema, 
samplingRatio)
405 else:
406 rdd, schema = self._createFromLocal(data, schema)

/mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
schema, samplingRatio)
296 rows = rdd.take(10)
297 for row in rows:
--> 298 _verify_type(row, schema)
299 
300 else:

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1152  "length of fields (%d)" % (len(obj), 
len(dataType.fields)))
   1153 for v, f in zip(obj, dataType.fields):
-> 1154 _verify_type(v, f.dataType)
   1155 
   1156 

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1136 # subclass of them can not be fromInternald in JVM
   1137 if type(obj) not in _acceptable_types[_type]:
-> 1138 raise TypeError("%s can not accept object in type %s" % 
(dataType, type(obj)))
   1139 
   1140 if isinstance(dataType, ArrayType):

TypeError: DateType can not accept object in type 

{code}

  was:
I have following problem.
I created table.

{code}
CREATE TABLE `spark_test` (
`id` INT(11) NULL,
`date` DATE NULL
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
{code}

Then I'm trying to read data - date '1970-01-01' is converted to int. This 
makes data frame incompatible with its own schema.

{code}
df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
'spark_test')
print(df.collect())
df = sqlCtx.createDataFrame(df.rdd, df.schema)

[Row(id=1, date=0)]
---
TypeError Traceback (most recent call last)
 in ()
  1 df = 
sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
 'spark_test')
  2 print(df.collect())
> 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)

/mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
schema, samplingRatio)
402 
403 if isinstance(data, RDD):
--> 404 rdd, schema = self._createFromRDD(data, schema, 
samplingRatio)
405 else:
406 rdd, schema = self._createFromLocal(data, schema)

/mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
schema, samplingRatio)
296 rows = rdd.take(10)
297 for row in rows:
--> 298 _verify_type(row, schema)
299 
300 else:

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1152  "length of fields (%d)" % (len(obj), 
len(dataType.fields)))
   1153 for v, f in zip(obj, dataType.fields):
-> 1154 _verify_type(v, f.dataType)
   1155 
   1156 

/mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1136 # subclass of them can not be fromInternald in JVM
   1137 if type(obj) not in _acceptable_types[_type]:
-> 1138 raise TypeError("%s can not accept object in type %s" % 
(dataType, type(obj)))
   1139 
   1140 if isinstance(dataType, ArrayType):

TypeError: DateType can not accept object in type 

{code}


> Pyspark - Wrong DateType support
> 
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1

[jira] [Commented] (SPARK-9878) ReduceByKey + FullOuterJoin return 0 element if using an empty RDD

2015-09-01 Thread Alexey Grishchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725117#comment-14725117
 ] 

Alexey Grishchenko commented on SPARK-9878:
---

Not reproduced on master:
{code}
scala> println("ok 
:"+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])]).count)
ok :2
scala> println("ko: 
"+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])].reduceByKey((e1,
 e2) => e1 ++ e2)).count)
ko: 2
{code}

>  ReduceByKey + FullOuterJoin return 0 element if using an empty RDD
> ---
>
> Key: SPARK-9878
> URL: https://issues.apache.org/jira/browse/SPARK-9878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: linux ubuntu 64b spark-hadoop
> launched with Local[2]
>Reporter: durand remi
>Priority: Minor
>
> code to reproduce:
> println("ok 
> :"+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])]).count)
> println("ko: 
> "+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])].reduceByKey((e1,
>  e2) => e1 ++ e2)).count)
> what i expect: 
> ok: 2
> ko: 2
> but what i have:
> ok: 2
> ko: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9878) ReduceByKey + FullOuterJoin return 0 element if using an empty RDD

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9878.
--
Resolution: Cannot Reproduce

Agree, I also can't reproduce this.

>  ReduceByKey + FullOuterJoin return 0 element if using an empty RDD
> ---
>
> Key: SPARK-9878
> URL: https://issues.apache.org/jira/browse/SPARK-9878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: linux ubuntu 64b spark-hadoop
> launched with Local[2]
>Reporter: durand remi
>Priority: Minor
>
> code to reproduce:
> println("ok 
> :"+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])]).count)
> println("ko: 
> "+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])].reduceByKey((e1,
>  e2) => e1 ++ e2)).count)
> what i expect: 
> ok: 2
> ko: 2
> but what i have:
> ok: 2
> ko: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8730) Deser primitive class with Java serialization

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8730:
-
Assignee: Eugen Cepoi

> Deser primitive class with Java serialization
> -
>
> Key: SPARK-8730
> URL: https://issues.apache.org/jira/browse/SPARK-8730
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Eugen Cepoi
>Assignee: Eugen Cepoi
>Priority: Critical
> Fix For: 1.6.0
>
>
> Objects that contain as property a primitive Class, can not be deserialized 
> using java serde. Class.forName does not work for primitives.
> Exemple of object:
> class Foo extends Serializable {
>   val intClass = classOf[Int]
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10374:
--
Component/s: Build

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
> |   \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> {code}
> Clearly we can't force the version to be one way or the other. If I force 
> protobuf to use 2.5.0, then invoking Hadoop code from my application will 
> break as Hadoop 2.0.0 jars are compiled against protobuf-2.4. On the other 
> hand, forcing protobuf to use version 2.4 breaks spark-core code that is 
> compiled against protobuf-2.5. Note that protobuf-2.4 and protobuf-2.5 are 
> not binary compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10394) Make GBTParams use shared "stepSize"

2015-09-01 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10394:
---

 Summary: Make GBTParams use shared "stepSize"
 Key: SPARK-10394
 URL: https://issues.apache.org/jira/browse/SPARK-10394
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


GBTParams has "stepSize" as learning rate currently.
ML has shared param class "HasStepSize", GBTParams can extend from it rather 
than duplicated implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10394) Make GBTParams use shared "stepSize"

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725133#comment-14725133
 ] 

Apache Spark commented on SPARK-10394:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8552

> Make GBTParams use shared "stepSize"
> 
>
> Key: SPARK-10394
> URL: https://issues.apache.org/jira/browse/SPARK-10394
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> GBTParams has "stepSize" as learning rate currently.
> ML has shared param class "HasStepSize", GBTParams can extend from it rather 
> than duplicated implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10394) Make GBTParams use shared "stepSize"

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10394:


Assignee: (was: Apache Spark)

> Make GBTParams use shared "stepSize"
> 
>
> Key: SPARK-10394
> URL: https://issues.apache.org/jira/browse/SPARK-10394
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> GBTParams has "stepSize" as learning rate currently.
> ML has shared param class "HasStepSize", GBTParams can extend from it rather 
> than duplicated implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10394) Make GBTParams use shared "stepSize"

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10394:


Assignee: Apache Spark

> Make GBTParams use shared "stepSize"
> 
>
> Key: SPARK-10394
> URL: https://issues.apache.org/jira/browse/SPARK-10394
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> GBTParams has "stepSize" as learning rate currently.
> ML has shared param class "HasStepSize", GBTParams can extend from it rather 
> than duplicated implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9622) DecisionTreeRegressor: provide variance of prediction

2015-09-01 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725137#comment-14725137
 ] 

Yanbo Liang commented on SPARK-9622:


I agree to return a Double column of variances at present.
I will try to submit PR.

> DecisionTreeRegressor: provide variance of prediction
> -
>
> Key: SPARK-9622
> URL: https://issues.apache.org/jira/browse/SPARK-9622
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10395) Simplify CatalystReadSupport

2015-09-01 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10395:
--

 Summary: Simplify CatalystReadSupport
 Key: SPARK-10395
 URL: https://issues.apache.org/jira/browse/SPARK-10395
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


The API interface of Parquet {{ReadSupport}} is a little bit over complicated 
because of historical reasons.  In older versions of parquet-mr (say 1.6.0rc3 
and prior), {{ReadSupport}} need to be instantiated and initialized twice on 
both driver side and executor side.  The {{init()}} method is for driver side 
initialization, while {{prepareForRead()}} is for executor side.  However, 
starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} is 
only instantiated and initialized on executor side.  So, theoretically, now 
it's totally fine to combine these two methods into a single initialization 
method.  The only reason (I could think of) to still have them here is for 
parquet-mr API backwards-compatibility.

Due to this reason, we no longer need to rely on {{ReadContext}} to pass 
requested schema from {{init()}} to {{prepareForRead()}}, using a private `var` 
for requested schema in {{CatalystReadSupport}} would be enough.

Another thing is that, after removing the old Parquet support code, now we 
always set Catalyst requested schema properly when reading Parquet files.  So 
all those "fallback" logic in {{CatalystReadSupport}} is now redundant.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10395) Simplify CatalystReadSupport

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725175#comment-14725175
 ] 

Apache Spark commented on SPARK-10395:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8553

> Simplify CatalystReadSupport
> 
>
> Key: SPARK-10395
> URL: https://issues.apache.org/jira/browse/SPARK-10395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> The API interface of Parquet {{ReadSupport}} is a little bit over complicated 
> because of historical reasons.  In older versions of parquet-mr (say 1.6.0rc3 
> and prior), {{ReadSupport}} need to be instantiated and initialized twice on 
> both driver side and executor side.  The {{init()}} method is for driver side 
> initialization, while {{prepareForRead()}} is for executor side.  However, 
> starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} 
> is only instantiated and initialized on executor side.  So, theoretically, 
> now it's totally fine to combine these two methods into a single 
> initialization method.  The only reason (I could think of) to still have them 
> here is for parquet-mr API backwards-compatibility.
> Due to this reason, we no longer need to rely on {{ReadContext}} to pass 
> requested schema from {{init()}} to {{prepareForRead()}}, using a private 
> `var` for requested schema in {{CatalystReadSupport}} would be enough.
> Another thing is that, after removing the old Parquet support code, now we 
> always set Catalyst requested schema properly when reading Parquet files.  So 
> all those "fallback" logic in {{CatalystReadSupport}} is now redundant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10395) Simplify CatalystReadSupport

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10395:


Assignee: Cheng Lian  (was: Apache Spark)

> Simplify CatalystReadSupport
> 
>
> Key: SPARK-10395
> URL: https://issues.apache.org/jira/browse/SPARK-10395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> The API interface of Parquet {{ReadSupport}} is a little bit over complicated 
> because of historical reasons.  In older versions of parquet-mr (say 1.6.0rc3 
> and prior), {{ReadSupport}} need to be instantiated and initialized twice on 
> both driver side and executor side.  The {{init()}} method is for driver side 
> initialization, while {{prepareForRead()}} is for executor side.  However, 
> starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} 
> is only instantiated and initialized on executor side.  So, theoretically, 
> now it's totally fine to combine these two methods into a single 
> initialization method.  The only reason (I could think of) to still have them 
> here is for parquet-mr API backwards-compatibility.
> Due to this reason, we no longer need to rely on {{ReadContext}} to pass 
> requested schema from {{init()}} to {{prepareForRead()}}, using a private 
> `var` for requested schema in {{CatalystReadSupport}} would be enough.
> Another thing is that, after removing the old Parquet support code, now we 
> always set Catalyst requested schema properly when reading Parquet files.  So 
> all those "fallback" logic in {{CatalystReadSupport}} is now redundant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10395) Simplify CatalystReadSupport

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10395:


Assignee: Apache Spark  (was: Cheng Lian)

> Simplify CatalystReadSupport
> 
>
> Key: SPARK-10395
> URL: https://issues.apache.org/jira/browse/SPARK-10395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>
> The API interface of Parquet {{ReadSupport}} is a little bit over complicated 
> because of historical reasons.  In older versions of parquet-mr (say 1.6.0rc3 
> and prior), {{ReadSupport}} need to be instantiated and initialized twice on 
> both driver side and executor side.  The {{init()}} method is for driver side 
> initialization, while {{prepareForRead()}} is for executor side.  However, 
> starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} 
> is only instantiated and initialized on executor side.  So, theoretically, 
> now it's totally fine to combine these two methods into a single 
> initialization method.  The only reason (I could think of) to still have them 
> here is for parquet-mr API backwards-compatibility.
> Due to this reason, we no longer need to rely on {{ReadContext}} to pass 
> requested schema from {{init()}} to {{prepareForRead()}}, using a private 
> `var` for requested schema in {{CatalystReadSupport}} would be enough.
> Another thing is that, after removing the old Parquet support code, now we 
> always set Catalyst requested schema properly when reading Parquet files.  So 
> all those "fallback" logic in {{CatalystReadSupport}} is now redundant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725189#comment-14725189
 ] 

Zoltán Zvara commented on SPARK-10390:
--

I did not packed Guava with my app, this is a clean Spark build in terms of 
dependencies, built with:

{{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725189#comment-14725189
 ] 

Zoltán Zvara edited comment on SPARK-10390 at 9/1/15 11:03 AM:
---

I did not pack Guava with my app, this is a clean Spark build in terms of 
dependencies, built with:

{{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}


was (Author: ehnalis):
I did not packed Guava with my app, this is a clean Spark build in terms of 
dependencies, built with:

{{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725199#comment-14725199
 ] 

Sean Owen commented on SPARK-10390:
---

It definitely means you have a later version of Guava in your deployment 
somehow, than Spark or Hadoop expects. The version you have packaged doesn't 
contain a method that the older one does. Try the Maven build, to narrow it 
down? it's the built of reference, not SBT.

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-01 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725211#comment-14725211
 ] 

Vinod KC commented on SPARK-10199:
--

[~mengxr]
I've measured the overhead of reflexion in save/load operation, please refer 
the results in this link
https://github.com/vinodkc/xtique/blob/master/overhead_duetoReflection.csv

Also I've measured the performance gain in save/load methods without reflexion 
after taking  average of 5  times test executions
Please refer the performance gain % in this two links
https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_save.csv
https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_load.csv


> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10396) spark-sql ctrl+c does not exit

2015-09-01 Thread linbao111 (JIRA)
linbao111 created SPARK-10396:
-

 Summary: spark-sql ctrl+c does not exit
 Key: SPARK-10396
 URL: https://issues.apache.org/jira/browse/SPARK-10396
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: linbao111


if you type "ctrl+c",spark-sql process exit(yarn-client mode),but you can still 
see spark job on cluster job browser,which redirect to dirver host 4040 sparkui 
service



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10396) spark-sql ctrl+c does not exit

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10396.
---
Resolution: Duplicate

It's helpful if you can please search JIRA first. It easy to find several 
issues on the same topic. Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a JIRA.

> spark-sql ctrl+c does not exit
> --
>
> Key: SPARK-10396
> URL: https://issues.apache.org/jira/browse/SPARK-10396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: linbao111
>
> if you type "ctrl+c",spark-sql process exit(yarn-client mode),but you can 
> still see spark job on cluster job browser,which redirect to dirver host 4040 
> sparkui service



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-09-01 Thread Rajeev Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725284#comment-14725284
 ] 

Rajeev Reddy commented on SPARK-5226:
-

Hello Aliaksei Litouka, I have looked into your implementation you are taking 
coordinate points i.e double as input for clustering can you please tell me how 
I can extend this for clustering a set of Text Documents

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10392:
---
Summary: Pyspark - Wrong DateType support on JDBC connection  (was: Pyspark 
- Wrong DateType support)

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10261) Add @Since annotation to ml.evaluation

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725300#comment-14725300
 ] 

Apache Spark commented on SPARK-10261:
--

User 'tijoparacka' has created a pull request for this issue:
https://github.com/apache/spark/pull/8554

> Add @Since annotation to ml.evaluation
> --
>
> Key: SPARK-10261
> URL: https://issues.apache.org/jira/browse/SPARK-10261
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10261) Add @Since annotation to ml.evaluation

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10261:


Assignee: (was: Apache Spark)

> Add @Since annotation to ml.evaluation
> --
>
> Key: SPARK-10261
> URL: https://issues.apache.org/jira/browse/SPARK-10261
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10261) Add @Since annotation to ml.evaluation

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10261:


Assignee: Apache Spark

> Add @Since annotation to ml.evaluation
> --
>
> Key: SPARK-10261
> URL: https://issues.apache.org/jira/browse/SPARK-10261
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10162) PySpark filters with datetimes mess up when datetimes have timezones.

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725329#comment-14725329
 ] 

Apache Spark commented on SPARK-10162:
--

User '0x0FFF' has created a pull request for this issue:
https://github.com/apache/spark/pull/8555

> PySpark filters with datetimes mess up when datetimes have timezones.
> -
>
> Key: SPARK-10162
> URL: https://issues.apache.org/jira/browse/SPARK-10162
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Kevin Cox
>
> PySpark appears to ignore timezone information when filtering on (and working 
> in general with) datetimes.
> Please see the example below. The generated filter in the query plan is 5 
> hours off (my computer is EST).
> {code}
> In [1]: df = sc.sql.createDataFrame([], StructType([StructField("dt", 
> TimestampType())]))
> In [2]: df.filter(df.dt > datetime(2000, 01, 01, tzinfo=UTC)).explain()
> Filter (dt#9 > 9467028)
>  Scan PhysicalRDD[dt#9]
> {code}
> Note that 9467028 == Sat  1 Jan 2000 05:00:00 UTC



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"

2015-09-01 Thread Sergey Tryuber (JIRA)
Sergey Tryuber created SPARK-10397:
--

 Summary: Make Python's SparkContext self-descriptive on "print sc"
 Key: SPARK-10397
 URL: https://issues.apache.org/jira/browse/SPARK-10397
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Sergey Tryuber
Priority: Trivial


When I execute in Python shell:
{code}
print sc
{code}
I receive something like:
{noformat}

{noformat}
But this is very inconvenient, especially if a user wants to create a 
good-looking and self-descriptive IPython Notebook. He would like to see some 
information about his Spark cluster.

In contrast, H2O context does have this feature and it is very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10262) Add @Since annotation to ml.attribute

2015-09-01 Thread Tijo Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725388#comment-14725388
 ] 

Tijo Thomas commented on SPARK-10262:
-

I am working on this.

> Add @Since annotation to ml.attribute
> -
>
> Key: SPARK-10262
> URL: https://issues.apache.org/jira/browse/SPARK-10262
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread Alexey Grishchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725390#comment-14725390
 ] 

Alexey Grishchenko commented on SPARK-10392:


This is a corner case for DateType.fromInternal implementation:
{code}
>>> from pyspark.sql.types import *
>>> a = DateType()
>>> a.fromInternal(0)
0
>>> a.fromInternal(1)
datetime.date(1970, 1, 2)
{code}

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10392:


Assignee: Apache Spark

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725404#comment-14725404
 ] 

Maciej Bryński commented on SPARK-10392:


{code}
class DateType(AtomicType):
"""Date (datetime.date) data type.
"""

def fromInternal(self, v):
*return v* and datetime.date.fromordinal(v + self.EPOCH_ORDINAL)

{code}

Yep,
With v = 0 there is no conversion to date.

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10392:


Assignee: (was: Apache Spark)

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725404#comment-14725404
 ] 

Maciej Bryński edited comment on SPARK-10392 at 9/1/15 1:38 PM:


{code}
class DateType(AtomicType):
"""Date (datetime.date) data type.
"""

def fromInternal(self, v):
return v and datetime.date.fromordinal(v + self.EPOCH_ORDINAL)

{code}

Yep,
With v = 0 there is no conversion to date.


was (Author: maver1ck):
{code}
class DateType(AtomicType):
"""Date (datetime.date) data type.
"""

def fromInternal(self, v):
*return v* and datetime.date.fromordinal(v + self.EPOCH_ORDINAL)

{code}

Yep,
With v = 0 there is no conversion to date.

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725403#comment-14725403
 ] 

Apache Spark commented on SPARK-10392:
--

User '0x0FFF' has created a pull request for this issue:
https://github.com/apache/spark/pull/8556

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10392:
---
Comment: was deleted

(was: {code}
class DateType(AtomicType):
"""Date (datetime.date) data type.
"""

def fromInternal(self, v):
return v and datetime.date.fromordinal(v + self.EPOCH_ORDINAL)

{code}

Yep,
With v = 0 there is no conversion to date.)

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-10398:
---

 Summary: Migrate Spark download page to use new lua mirroring 
scripts
 Key: SPARK-10398
 URL: https://issues.apache.org/jira/browse/SPARK-10398
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Luciano Resende


>From infra team :

If you refer to www.apache.org/dyn/closer.cgi, please refer to
www.apache.org/dyn/closer.lua instead from now on.

Any non-conforming CGI scripts are no longer enabled, and are all
rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10398:
--
  Assignee: Sean Owen
  Priority: Minor  (was: Major)
Issue Type: Task  (was: Bug)

No problem, pushing the change now.

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10398.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-01 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725211#comment-14725211
 ] 

Vinod KC edited comment on SPARK-10199 at 9/1/15 2:15 PM:
--

[~mengxr]
I've measured the overhead of reflection in save/load operation, please refer 
the results in this link
https://github.com/vinodkc/xtique/blob/master/overhead_duetoReflection.csv

Also I've measured the performance gain in save/load methods without reflection 
after taking  average of 5  times test executions
Please refer the performance gain % in this two links
https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_save.csv
https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_load.csv



was (Author: vinodkc):
[~mengxr]
I've measured the overhead of reflexion in save/load operation, please refer 
the results in this link
https://github.com/vinodkc/xtique/blob/master/overhead_duetoReflection.csv

Also I've measured the performance gain in save/load methods without reflexion 
after taking  average of 5  times test executions
Please refer the performance gain % in this two links
https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_save.csv
https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_load.csv


> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2015-09-01 Thread Paul Weiss (JIRA)
Paul Weiss created SPARK-10399:
--

 Summary: Off Heap Memory Access for non-JVM libraries (C++)
 Key: SPARK-10399
 URL: https://issues.apache.org/jira/browse/SPARK-10399
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Paul Weiss


*Summary*
Provide direct off-heap memory access to an external non-JVM program such as a 
c++ library within the Spark running JVM/executor.  As Spark moves to storing 
all data into off heap memory it makes sense to provide access points to the 
memory for non-JVM programs.


*Assumptions*
* Zero copies will be made during the call into non-JVM library
* Access into non-JVM libraries will be accomplished via JNI
* A generic JNI interface will be created so that developers will not need to 
deal with the raw JNI call
* C++ will be the initial target non-JVM use case
* memory management will remain on the JVM/Spark side
* the API from C++ will be similar to dataframes as much as feasible and NOT 
require expert knowledge of JNI
* Data organization and layout will support complex (multi-type, nested, etc.) 
types


*Design*
* Initially Spark JVM -> non-JVM will be supported 
* Creating an embedded JVM with Spark running from a non-JVM program is 
initially out of scope


*Technical*
* GetDirectBufferAddress is the JNI call used to access byte buffer without copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work

2015-09-01 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725473#comment-14725473
 ] 

Alex Rovner commented on SPARK-10375:
-

May I suggest throwing an exception when certain properties are set that will 
not take effect? (spark.driver.*)

> Setting the driver memory with SparkConf().set("spark.driver.memory","1g") 
> does not work
> 
>
> Key: SPARK-10375
> URL: https://issues.apache.org/jira/browse/SPARK-10375
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Running with yarn
>Reporter: Thomas
>Priority: Minor
>
> When running pyspark 1.3.0 with yarn, the following code has no effect:
> pyspark.SparkConf().set("spark.driver.memory","1g")
> The Environment tab in yarn shows that the driver has 1g, however, the 
> Executors tab only shows 512 M (the default value) for the driver memory.  
> This issue goes away when the driver memory is specified via the command line 
> (i.e. --driver-memory 1g)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10375:
--
Issue Type: Improvement  (was: Bug)

> Setting the driver memory with SparkConf().set("spark.driver.memory","1g") 
> does not work
> 
>
> Key: SPARK-10375
> URL: https://issues.apache.org/jira/browse/SPARK-10375
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Running with yarn
>Reporter: Thomas
>Priority: Minor
>
> When running pyspark 1.3.0 with yarn, the following code has no effect:
> pyspark.SparkConf().set("spark.driver.memory","1g")
> The Environment tab in yarn shows that the driver has 1g, however, the 
> Executors tab only shows 512 M (the default value) for the driver memory.  
> This issue goes away when the driver memory is specified via the command line 
> (i.e. --driver-memory 1g)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size

2015-09-01 Thread Xiaoyu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725530#comment-14725530
 ] 

Xiaoyu Wang commented on SPARK-10314:
-

Yes,Any questions with the pull request?
Do you need me to resubmit a pull request for master branch?
The previous pull request is submit to branch-1.4!

> [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception 
> when parallelism is big than data split size
> 
>
> Key: SPARK-10314
> URL: https://issues.apache.org/jira/browse/SPARK-10314
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.4.1
> Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4
>Reporter: Xiaoyu Wang
>Priority: Minor
>
> RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when 
> parallelism is big than data split size
> {code}
> val rdd = sc.parallelize(List(1, 2),2)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> is ok.
> {code}
> val rdd = sc.parallelize(List(1, 2),3)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> got exceptoin:
> {noformat}
> 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24
> 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 
> output partitions (allowLocal=false)
> 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at 
> :24)
> 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 
> (ParallelCollectionRDD[0] at parallelize at :21), which has no 
> missing parents
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with 
> curMem=0, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 1096.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with 
> curMem=1096, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 788.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43776 (size: 788.0 B, free: 706.9 MB)
> 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:874
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from 
> ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21)
> 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1269 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
> 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it
> 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started
> 15/08/27 17:53:08 WARN : tachyon.home is not set. Using 
> /mnt/tachyon_default_home as the default value.
> 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect 
> master @ localhost/127.0.0.1:19998
> 15/08/27 17:53:08 INFO : User registered at the master 
> localhost/127.0.0.1:19998 got UserId 109
> 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at 
> /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5
> 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost
> 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998
> 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was 
> created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 
> was created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 
> was created!
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore 
> on localhost:43776 (size: 0.0 B)
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore 
> on localhost:43776 (size: 2.0 B)
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added r

[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage

2015-09-01 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725552#comment-14725552
 ] 

Imran Rashid commented on SPARK-2666:
-

I'm copying [~kayousterhout]'s comment from the PR here for discussion:

bq. My understanding is that it can help to let the remaining tasks run -- 
because they may hit Fetch failures from different map outputs than the 
original fetch failure, which will lead to the DAGScheduler to more quickly 
reschedule all of the failed tasks. For example, if an executor failed and had 
multiple map outputs on it, the first Fetch failure will only tell us about one 
of the map outputs being missing, and it's helpful to learn about all of them 
before we resubmit the earlier stage. Did you already think about this / am I 
misunderstanding the issue?

Things may have changed in the meantime, but I'm pretty sure that now, when 
there is a fetch failure, spark assumes its lost *all* of the map output for 
that host.  Its a bit confusing -- it seems we first only remove [the one map 
output with the 
failure|https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1134]
 but then we remove all map outputs in [{{handleExecutorLost}} | 
https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1184].
  I suppose it could still be useful to run the remaining tasks, as they may 
discover *another* executor that has died, but I don't think its worth it just 
for that, right?

Elsewhere we've also discussed always killing all tasks as soon as the 
{{TaskSetManager}} is marked as a zombie, see 
https://github.com/squito/spark/pull/4.

I'm particularly interested b/c this is relevant to SPARK-10370.  In that case, 
there wouldn't be any benefit to leaving tasks as running after marking the 
stage as zombie.  If we do want to cancel all tasks as soon as we mark a stage 
as zombie, then I'd prefer we go the route of making {{isZombie}} private, and 
make task cancellation part of {{markAsZombie}} to make the code easier to 
follow and make sure we always cancel tasks.

Is my understanding correct?  Other opinions on the right approach here?

> when task is FetchFailed cancel running tasks of failedStage
> 
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Lianhui Wang
>
> in DAGScheduler's handleTaskCompletion,when reason of failed task is 
> FetchFailed, cancel running tasks of failedStage before add failedStage to 
> failedStages queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies

2015-09-01 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-10370:
-
Description: 
Follow up to SPARK-5259.  During stage retry, its possible for a stage to 
"complete" by registering all its map output and starting the downstream 
stages, before the latest task set has completed.  This will result in the 
earlier task set continuing to submit tasks, that are both unnecessary and 
increase the chance of hitting SPARK-8029.

Spark should mark all tasks sets for a stage as zombie as soon as its map 
output is registered.  Note that this involves coordination between the various 
scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at least) which 
isn't easily testable with the current setup.

To be clear, this is *not* just referring to canceling running tasks (which may 
be taken care of by SPARK-2666).  This is to make sure that the taskset is 
marked as a zombie, to prevent submitting *new* tasks from this task set.

  was:
Follow up to SPARK-5259.  During stage retry, its possible for a stage to 
"complete" by registering all its map output and starting the downstream 
stages, before the latest task set has completed.  This will result in the 
earlier task set continuing to submit tasks, that are both unnecessary and 
increase the chance of hitting SPARK-8029.

Spark should mark all tasks sets for a stage as zombie as soon as its map 
output is registered.  Note that this involves coordination between the various 
scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at least) which 
isn't easily testable with the current setup.


> After a stages map outputs are registered, all running attempts should be 
> marked as zombies
> ---
>
> Key: SPARK-10370
> URL: https://issues.apache.org/jira/browse/SPARK-10370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>
> Follow up to SPARK-5259.  During stage retry, its possible for a stage to 
> "complete" by registering all its map output and starting the downstream 
> stages, before the latest task set has completed.  This will result in the 
> earlier task set continuing to submit tasks, that are both unnecessary and 
> increase the chance of hitting SPARK-8029.
> Spark should mark all tasks sets for a stage as zombie as soon as its map 
> output is registered.  Note that this involves coordination between the 
> various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at 
> least) which isn't easily testable with the current setup.
> To be clear, this is *not* just referring to canceling running tasks (which 
> may be taken care of by SPARK-2666).  This is to make sure that the taskset 
> is marked as a zombie, to prevent submitting *new* tasks from this task set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Luciano Resende (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luciano Resende reopened SPARK-10398:
-

There are few other places where the closer.cgi is referenced.

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Luciano Resende (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luciano Resende updated SPARK-10398:

Attachment: SPARK-10398

This patch handles other download links referenced on the Spark docs as well.

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
> Attachments: SPARK-10398
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9988) Create local (external) sort operator

2015-09-01 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725589#comment-14725589
 ] 

Shixiong Zhu commented on SPARK-9988:
-

{{ExternalSorter}} is coupled with {{SparkEnv}}, {{ShuffleMemoryManager}} and 
{{DiskBlockManager}}, and finally depends on {{SparkContext}}. [~rxin] any 
thoughts to avoid depending on {{SparkContext}}? I'm thinking that at least we 
need something like {{ShuffleMemoryManager}} and {{DiskBlockManager}}.

> Create local (external) sort operator
> -
>
> Key: SPARK-9988
> URL: https://issues.apache.org/jira/browse/SPARK-9988
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Similar to the TungstenSort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725592#comment-14725592
 ] 

Sean Owen commented on SPARK-10398:
---

Good catch, there's another use in the project docs themselves, not just the 
Apache site's download link. We use PRs rather than patches 
(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) but I 
can easily do that. 

Editing the old doc releases gives me pause since they'd then not be the same 
docs you'd get by generating docs from the old release tag. However I suspect 
it matters little either way and so should just be fixed.

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
> Attachments: SPARK-10398
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10400) Rename or deprecate SQL option "spark.sql.parquet.followParquetFormatSpec"

2015-09-01 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10400:
--

 Summary: Rename or deprecate SQL option 
"spark.sql.parquet.followParquetFormatSpec"
 Key: SPARK-10400
 URL: https://issues.apache.org/jira/browse/SPARK-10400
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while 
working on implementing Parquet backwards-compatibility rules in SPARK-6777. It 
indicates whether we should use legacy Parquet format adopted by Spark 1.4 and 
prior versions or the standard format defined in parquet-format spec. However, 
the name of this option is somewhat confusing, because it's not super intuitive 
why we shouldn't follow the spec. Would be nice to rename it to 
"spark.sql.parquet.writeLegacyFormat" and invert its default value (they have 
opposite meanings). Note that this option is not "public" ({{isPublic}} is 
false).

At the moment of writing, 1.5 RC3 has already been cut. If we can't make this 
one into 1.5, we can deprecate the old option with the new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage

2015-09-01 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725615#comment-14725615
 ] 

Imran Rashid commented on SPARK-2666:
-

I realized I didn't very clearly spell out one of my main points: I am 
proposing widening this issue to not be about only {{FetchFailed}}.  I think 
instead we should consider changing this issue to refactor the code to unify 
"zombification" and cancelling tasks.  In general I know that smaller changes 
are better, especially related to the scheduler, but in this case I think we'll 
be able to improve the code by tackling them together.

> when task is FetchFailed cancel running tasks of failedStage
> 
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Lianhui Wang
>
> in DAGScheduler's handleTaskCompletion,when reason of failed task is 
> FetchFailed, cancel running tasks of failedStage before add failedStage to 
> failedStages queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Luciano Resende (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725619#comment-14725619
 ] 

Luciano Resende commented on SPARK-10398:
-

I can submit a PR for the docs as well, let me look into those.

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
> Attachments: SPARK-10398
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725622#comment-14725622
 ] 

Apache Spark commented on SPARK-10398:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8557

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
> Attachments: SPARK-10398
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725632#comment-14725632
 ] 

Apache Spark commented on SPARK-10398:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/8558

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
> Attachments: SPARK-10398
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10398:
--
Assignee: Luciano Resende  (was: Sean Owen)

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Luciano Resende
>Priority: Minor
> Fix For: 1.5.0
>
> Attachments: SPARK-10398
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9008) Stop and remove driver from supervised mode in spark-master interface

2015-09-01 Thread Alberto Miorin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725637#comment-14725637
 ] 

Alberto Miorin commented on SPARK-9008:
---

I have the same problem, but with spark mesos cluster mode. I tried to 
spark-submit --kill but the driver is always restarted
by the dispatcher.
I think there should be a subcommand spark-submit --unsupervise

> Stop and remove driver from supervised mode in spark-master interface
> -
>
> Key: SPARK-9008
> URL: https://issues.apache.org/jira/browse/SPARK-9008
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Reporter: Jesper Lundgren
>Priority: Minor
>
> The cluster will automatically restart failing drivers when launched in 
> supervised cluster mode. However there is no official way for a operation 
> team to stop and remove a driver from restarting in case  it is 
> malfunctioning. 
> I know there is "bin/spark-class org.apache.spark.deploy.Client kill" but 
> this is undocumented and does not always work so well.
> It would be great if there was a way to remove supervised mode to allow kill 
> -9 to work on a driver program.
> The documentation surrounding this could also see some improvements. It would 
> be nice to have some best practice examples on how to work with supervised 
> mode, how to manage graceful shutdown and catch TERM signals. (TERM signal 
> will end with an exit code that triggers restart in supervised mode unless 
> you change the exit code in the application logic)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10401) spark-submit --unsupervise

2015-09-01 Thread Alberto Miorin (JIRA)
Alberto Miorin created SPARK-10401:
--

 Summary: spark-submit --unsupervise 
 Key: SPARK-10401
 URL: https://issues.apache.org/jira/browse/SPARK-10401
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Mesos
Affects Versions: 1.5.0
Reporter: Alberto Miorin


When I submit a streaming job with the option --supervise to the new mesos 
spark dispatcher, I cannot decommission the job.
I tried spark-submit --kill, but dispatcher always restarts it.
Driver and Executors are both Docker containers.

I think there should be a subcommand spark-submit --unsupervise 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >