[jira] [Created] (SPARK-10387) Code generation for decision tree
Xiangrui Meng created SPARK-10387: - Summary: Code generation for decision tree Key: SPARK-10387 URL: https://issues.apache.org/jira/browse/SPARK-10387 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: DB Tsai Provide code generation for decision tree and tree ensembles. Let's first discuss the design and then create new JIRAs for tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7132) Add fit with validation set to spark.ml GBT
[ https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724889#comment-14724889 ] Yanbo Liang commented on SPARK-7132: I will work on this issue. [~josephkb] I propose another way to resolve this issue. The GBT Estimator remains take 1 input {DataFrame}, and we will split it into training and validation dataset internal. Because the runWithValidation interface will take RDD[LabeledPoint] as input, it's easy to handle this. And at the end of the GBT Estimator, we can also union these two dataset. > Add fit with validation set to spark.ml GBT > --- > > Key: SPARK-7132 > URL: https://issues.apache.org/jira/browse/SPARK-7132 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.mllib GradientBoostedTrees, we have a method runWithValidation which > takes a validation set. We should add that to the spark.ml API. > This will require a bit of thinking about how the Pipelines API should handle > a validation set (since Transformers and Estimators only take 1 input > DataFrame). The current plan is to include an extra column in the input > DataFrame which indicates whether the row is for training, validation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7132) Add fit with validation set to spark.ml GBT
[ https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724889#comment-14724889 ] Yanbo Liang edited comment on SPARK-7132 at 9/1/15 7:02 AM: I will work on this issue. [~josephkb] I propose another way to resolve this issue. The GBT Estimator remains take 1 input {code|DataFrame}, and we will split it into training and validation dataset internal. Because the runWithValidation interface will take RDD[LabeledPoint] as input, it's easy to handle this. And at the end of the GBT Estimator, we can also union these two dataset. was (Author: yanboliang): I will work on this issue. [~josephkb] I propose another way to resolve this issue. The GBT Estimator remains take 1 input {DataFrame}, and we will split it into training and validation dataset internal. Because the runWithValidation interface will take RDD[LabeledPoint] as input, it's easy to handle this. And at the end of the GBT Estimator, we can also union these two dataset. > Add fit with validation set to spark.ml GBT > --- > > Key: SPARK-7132 > URL: https://issues.apache.org/jira/browse/SPARK-7132 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.mllib GradientBoostedTrees, we have a method runWithValidation which > takes a validation set. We should add that to the spark.ml API. > This will require a bit of thinking about how the Pipelines API should handle > a validation set (since Transformers and Estimators only take 1 input > DataFrame). The current plan is to include an extra column in the input > DataFrame which indicates whether the row is for training, validation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7132) Add fit with validation set to spark.ml GBT
[ https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724889#comment-14724889 ] Yanbo Liang edited comment on SPARK-7132 at 9/1/15 7:03 AM: I will work on this issue. [~josephkb] I propose another way to resolve this issue. The GBT Estimator remains take 1 input DataFrame, and we will split it into training and validation dataset internal. Because the runWithValidation interface will take RDD[LabeledPoint] as input, it's easy to handle this. And at the end of the GBT Estimator, we can also union these two dataset. was (Author: yanboliang): I will work on this issue. [~josephkb] I propose another way to resolve this issue. The GBT Estimator remains take 1 input {code|DataFrame}, and we will split it into training and validation dataset internal. Because the runWithValidation interface will take RDD[LabeledPoint] as input, it's easy to handle this. And at the end of the GBT Estimator, we can also union these two dataset. > Add fit with validation set to spark.ml GBT > --- > > Key: SPARK-7132 > URL: https://issues.apache.org/jira/browse/SPARK-7132 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.mllib GradientBoostedTrees, we have a method runWithValidation which > takes a validation set. We should add that to the spark.ml API. > This will require a bit of thinking about how the Pipelines API should handle > a validation set (since Transformers and Estimators only take 1 input > DataFrame). The current plan is to include an extra column in the input > DataFrame which indicates whether the row is for training, validation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10388) Public dataset loader interface
Xiangrui Meng created SPARK-10388: - Summary: Public dataset loader interface Key: SPARK-10388 URL: https://issues.apache.org/jira/browse/SPARK-10388 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is very useful to have a public dataset loader to fetch ML datasets from popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, requirements, and initial implementation. {code} val loader = new DatasetLoader(sqlContext) val df = loader.get("libsvm", "rcv1_train.binary") {code} User should be able to list (or preview) datasets, e.g. {code} val datasets = loader.ls("libsvm") // returns a local DataFrame datasets.show() // list all datasets under libsvm repo {code} It would be nice to allow 3rd-party packages to register new repos. Both the API and implementation are pending discussion. Note that this requires http and https support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10324: -- Description: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add `@Since("1.6.0")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * log-linear model for survival analysis (SPARK-8518) * normal equation approach for linear regression (SPARK-9834) * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) * robust linear regression with Huber loss (SPARK-3181) * vector-free L-BFGS (SPARK-10078) * tree partition by features (SPARK-3717) * bisecting k-means (SPARK-6517) * weighted instance support (SPARK-9610) * logistic regression (SPARK-7685) * linear regression (SPARK-9642) * random forest (SPARK-9478) * locality sensitive hashing (LSH) (SPARK-5992) * deep learning (SPARK-2352) * autoencoder (SPARK-4288) * restricted Boltzmann machine (RBM) (SPARK-4251) * convolutional neural network * factorization machine (SPARK-7008) * local linear algebra (SPARK-6442) * distributed LU decomposition (SPARK-8514) h2. Statistics * univariate statistics as UDAFs (SPARK-10384) * bivariate statistics as UDAFs (SPARK-10385) * R-like statistics for GLMs (SPARK-9835) * online hypothesis testing (SPARK-3147) h2. Pipeline API * pipeline persistence (SPARK-6725) * ML attribute API improvements (SPARK-8515) * feature transformers (SPARK-9930) * feature interaction (SPARK-9698) * SQL transformer (SPARK-8345) * ?? * test Kaggle datasets (SPARK-9941) h2. Model persistence * PMML export * naive Bayes (SPARK-8546) * decision tree (SPARK-8542) * model save/load * FPGrowth (SPARK-6724) * PrefixSpan (SPARK-10386) * code generation * decision tree and tree ensembles (SPARK-10387) h2. Data sources * LIBSVM data source (SPARK-10117) * public dataset loader (SPARK-10388) h2. Python API for ML The main goal of Python API is to have feature parity with Scala/Java API. * Python API for new algorithms * Python API for missing methods h2. SparkR API for ML * support more families and link functions in SparkR::glm (SPARK-9838, SPARK-9839, SPARK-9840) * better R formula support (SPARK-9681) * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837) h2. Documentation * re-organize user guide (SPARK-8517) * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751) * automatically test example code in user guide (SPARK-10382) was: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spar
[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10324: -- Description: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add `@Since("1.6.0")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * log-linear model for survival analysis (SPARK-8518) * normal equation approach for linear regression (SPARK-9834) * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) * robust linear regression with Huber loss (SPARK-3181) * vector-free L-BFGS (SPARK-10078) * tree partition by features (SPARK-3717) * bisecting k-means (SPARK-6517) * weighted instance support (SPARK-9610) * logistic regression (SPARK-7685) * linear regression (SPARK-9642) * random forest (SPARK-9478) * locality sensitive hashing (LSH) (SPARK-5992) * deep learning (SPARK-2352) * autoencoder (SPARK-4288) * restricted Boltzmann machine (RBM) (SPARK-4251) * convolutional neural network (stretch) * factorization machine (SPARK-7008) * local linear algebra (SPARK-6442) * distributed LU decomposition (SPARK-8514) h2. Statistics * univariate statistics as UDAFs (SPARK-10384) * bivariate statistics as UDAFs (SPARK-10385) * R-like statistics for GLMs (SPARK-9835) * online hypothesis testing (SPARK-3147) h2. Pipeline API * pipeline persistence (SPARK-6725) * ML attribute API improvements (SPARK-8515) * feature transformers (SPARK-9930) * feature interaction (SPARK-9698) * SQL transformer (SPARK-8345) * ?? * test Kaggle datasets (SPARK-9941) h2. Model persistence * PMML export * naive Bayes (SPARK-8546) * decision tree (SPARK-8542) * model save/load * FPGrowth (SPARK-6724) * PrefixSpan (SPARK-10386) * code generation * decision tree and tree ensembles (SPARK-10387) h2. Data sources * LIBSVM data source (SPARK-10117) * public dataset loader (SPARK-10388) h2. Python API for ML The main goal of Python API is to have feature parity with Scala/Java API. * Python API for new algorithms * Python API for missing methods h2. SparkR API for ML * support more families and link functions in SparkR::glm (SPARK-9838, SPARK-9839, SPARK-9840) * better R formula support (SPARK-9681) * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837) h2. Documentation * re-organize user guide (SPARK-8517) * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751) * automatically test example code in user guide (SPARK-10382) was: Following SPARK-8445, we created this master list for ML
[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10324: -- Description: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add `@Since("1.6.0")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * log-linear model for survival analysis (SPARK-8518) * normal equation approach for linear regression (SPARK-9834) * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) * robust linear regression with Huber loss (SPARK-3181) * vector-free L-BFGS (SPARK-10078) * tree partition by features (SPARK-3717) * bisecting k-means (SPARK-6517) * weighted instance support (SPARK-9610) * logistic regression (SPARK-7685) * linear regression (SPARK-9642) * random forest (SPARK-9478) * locality sensitive hashing (LSH) (SPARK-5992) * deep learning (SPARK-2352) * autoencoder (SPARK-4288) * restricted Boltzmann machine (RBM) (SPARK-4251) * convolutional neural network (stretch) * factorization machine (SPARK-7008) * local linear algebra (SPARK-6442) * distributed LU decomposition (SPARK-8514) h2. Statistics * univariate statistics as UDAFs (SPARK-10384) * bivariate statistics as UDAFs (SPARK-10385) * R-like statistics for GLMs (SPARK-9835) * online hypothesis testing (SPARK-3147) h2. Pipeline API * pipeline persistence (SPARK-6725) * ML attribute API improvements (SPARK-8515) * feature transformers (SPARK-9930) * feature interaction (SPARK-9698) * SQL transformer (SPARK-8345) * ?? * test Kaggle datasets (SPARK-9941) h2. Model persistence * PMML export * naive Bayes (SPARK-8546) * decision tree (SPARK-8542) * model save/load * FPGrowth (SPARK-6724) * PrefixSpan (SPARK-10386) * code generation * decision tree and tree ensembles (SPARK-10387) h2. Data sources * LIBSVM data source (SPARK-10117) * public dataset loader (SPARK-10388) h2. Python API for ML The main goal of Python API is to have feature parity with Scala/Java API. * Python API for new algorithms * Python API for missing methods h2. SparkR API for ML * support more families and link functions in SparkR::glm (SPARK-9838, SPARK-9839, SPARK-9840) * better R formula support (SPARK-9681) * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837) h2. Documentation * re-organize user guide (SPARK-8517) * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751) * automatically test example code in user guide (SPARK-10382) was: Following SPARK-8445, we created this master list for
[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10324: -- Priority: Blocker (was: Critical) > MLlib 1.6 Roadmap > - > > Key: SPARK-10324 > URL: https://issues.apache.org/jira/browse/SPARK-10324 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Blocker > > Following SPARK-8445, we created this master list for MLlib features we plan > to have in Spark 1.6. Please view this list as a wish list rather than a > concrete plan, because we don't have an accurate estimate of available > resources. Due to limited review bandwidth, features appearing on this list > will get higher priority during code review. But feel free to suggest new > items to the list in comments. We are experimenting with this process. Your > feedback would be greatly appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add `@Since("1.6.0")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if necessary. > h1. Roadmap (WIP) > This is NOT [a complete list of MLlib JIRAs for > 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include > umbrella JIRAs and high-level tasks. > h2. Algorithms and performance > * log-linear model for survival analysis (SPARK-8518) > * normal equation approach for linear regression (SPARK-9834) > * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) > * robust linear regression with Huber loss (SPARK-3181) > * vector-free L-BFGS (SPARK-10078) > * tree partition by features (SPARK-3717) > * bisecting k-means (SPARK-6517) > * weighted instance support (SPARK-9610) > ** logistic regression (SPARK-7685) > ** linear regression (SPARK-9642) > ** random forest (SPARK-9478) > * locality sensitive hashing (LSH) (SPARK-5992) > * deep learning (SPARK-2352) > ** autoencoder (SPARK-4288) > ** restricted Boltzmann machine (RBM) (SPARK-4251) > ** convolutional neural network (stretch) > * factorization machine (SPARK-7008) > * local linear algebra (SPARK-6442) > * distributed LU decomposition (SPARK-8514) > h2. Statistics > * univariate statistics as UDAFs (SPARK-10384) > * bivariate statistics as UDAFs (SPARK-10385) > * R-like statistics for GLMs (SPARK-9835) > * online hypothesis testing (SPARK-3147) > h2. Pipeline API > * pipeline persistence (SPARK-6725) > * ML attribute API improvements (SPARK-8515) > * feature transformers (SPARK-9930) > ** feature interaction (SPARK-9698) > ** SQL transformer (SPARK-8345) > ** ?? > * test Kaggle datasets (SPARK-9941) > h2. Model persistence > * PMML export > ** naive Bayes (SPARK-8546) > ** decision tree (SPARK-8542) > * model save/load > ** FPGrowth (SPARK-6724) > ** PrefixSpan (SPARK-10386) > * code generation > ** decision tree and tree ensembles (SPARK-10387) > h2. Data sources > * LIBSVM data source (SPARK-10117) > * public dataset loader (SPARK-10388) > h2. Python API for ML > The main goal of Python API is to have feature parity with Scala/Java API. > * Python API for new algorithms > * Python API for missing metho
[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10324: -- Description: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add `@Since("1.6.0")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * log-linear model for survival analysis (SPARK-8518) * normal equation approach for linear regression (SPARK-9834) * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) * robust linear regression with Huber loss (SPARK-3181) * vector-free L-BFGS (SPARK-10078) * tree partition by features (SPARK-3717) * bisecting k-means (SPARK-6517) * weighted instance support (SPARK-9610) ** logistic regression (SPARK-7685) ** linear regression (SPARK-9642) ** random forest (SPARK-9478) * locality sensitive hashing (LSH) (SPARK-5992) * deep learning (SPARK-2352) ** autoencoder (SPARK-4288) ** restricted Boltzmann machine (RBM) (SPARK-4251) ** convolutional neural network (stretch) * factorization machine (SPARK-7008) * local linear algebra (SPARK-6442) * distributed LU decomposition (SPARK-8514) h2. Statistics * univariate statistics as UDAFs (SPARK-10384) * bivariate statistics as UDAFs (SPARK-10385) * R-like statistics for GLMs (SPARK-9835) * online hypothesis testing (SPARK-3147) h2. Pipeline API * pipeline persistence (SPARK-6725) * ML attribute API improvements (SPARK-8515) * feature transformers (SPARK-9930) ** feature interaction (SPARK-9698) ** SQL transformer (SPARK-8345) ** ?? * test Kaggle datasets (SPARK-9941) h2. Model persistence * PMML export ** naive Bayes (SPARK-8546) ** decision tree (SPARK-8542) * model save/load ** FPGrowth (SPARK-6724) ** PrefixSpan (SPARK-10386) * code generation ** decision tree and tree ensembles (SPARK-10387) h2. Data sources * LIBSVM data source (SPARK-10117) * public dataset loader (SPARK-10388) h2. Python API for ML The main goal of Python API is to have feature parity with Scala/Java API. * Python API for new algorithms * Python API for missing methods h2. SparkR API for ML * support more families and link functions in SparkR::glm (SPARK-9838, SPARK-9839, SPARK-9840) * better R formula support (SPARK-9681) * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837) h2. Documentation * re-organize user guide (SPARK-8517) * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751) * automatically test example code in user guide (SPARK-10382) was: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6
[jira] [Commented] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724899#comment-14724899 ] Xiangrui Meng commented on SPARK-10324: --- Changed priority to blocker to make this list more discoverable. > MLlib 1.6 Roadmap > - > > Key: SPARK-10324 > URL: https://issues.apache.org/jira/browse/SPARK-10324 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Blocker > > Following SPARK-8445, we created this master list for MLlib features we plan > to have in Spark 1.6. Please view this list as a wish list rather than a > concrete plan, because we don't have an accurate estimate of available > resources. Due to limited review bandwidth, features appearing on this list > will get higher priority during code review. But feel free to suggest new > items to the list in comments. We are experimenting with this process. Your > feedback would be greatly appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add `@Since("1.6.0")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if necessary. > h1. Roadmap (WIP) > This is NOT [a complete list of MLlib JIRAs for > 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include > umbrella JIRAs and high-level tasks. > h2. Algorithms and performance > * log-linear model for survival analysis (SPARK-8518) > * normal equation approach for linear regression (SPARK-9834) > * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) > * robust linear regression with Huber loss (SPARK-3181) > * vector-free L-BFGS (SPARK-10078) > * tree partition by features (SPARK-3717) > * bisecting k-means (SPARK-6517) > * weighted instance support (SPARK-9610) > ** logistic regression (SPARK-7685) > ** linear regression (SPARK-9642) > ** random forest (SPARK-9478) > * locality sensitive hashing (LSH) (SPARK-5992) > * deep learning (SPARK-2352) > ** autoencoder (SPARK-4288) > ** restricted Boltzmann machine (RBM) (SPARK-4251) > ** convolutional neural network (stretch) > * factorization machine (SPARK-7008) > * local linear algebra (SPARK-6442) > * distributed LU decomposition (SPARK-8514) > h2. Statistics > * univariate statistics as UDAFs (SPARK-10384) > * bivariate statistics as UDAFs (SPARK-10385) > * R-like statistics for GLMs (SPARK-9835) > * online hypothesis testing (SPARK-3147) > h2. Pipeline API > * pipeline persistence (SPARK-6725) > * ML attribute API improvements (SPARK-8515) > * feature transformers (SPARK-9930) > ** feature interaction (SPARK-9698) > ** SQL transformer (SPARK-8345) > ** ?? > * test Kaggle datasets (SPARK-9941) > h2. Model persistence > * PMML export > ** naive Bayes (SPARK-8546) > ** decision tree (SPARK-8542) > * model save/load > ** FPGrowth (SPARK-6724) > ** PrefixSpan (SPARK-10386) > * code generation > ** decision tree and tree ensembles (SPARK-10387) > h2. Data sources > * LIBSVM data source (SPARK-10117) > * public dataset loader (SPARK-10388) > h2. Python API for ML > The main goal of Python API is to have feature parity wit
[jira] [Created] (SPARK-10389) support order by non-attribute grouping expression on Aggregate
Wenchen Fan created SPARK-10389: --- Summary: support order by non-attribute grouping expression on Aggregate Key: SPARK-10389 URL: https://issues.apache.org/jira/browse/SPARK-10389 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10389) support order by non-attribute grouping expression on Aggregate
[ https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10389: Description: For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 ORDER BY key + 1". > support order by non-attribute grouping expression on Aggregate > --- > > Key: SPARK-10389 > URL: https://issues.apache.org/jira/browse/SPARK-10389 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 > ORDER BY key + 1". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10389) support order by non-attribute grouping expression on Aggregate
[ https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724905#comment-14724905 ] Apache Spark commented on SPARK-10389: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8548 > support order by non-attribute grouping expression on Aggregate > --- > > Key: SPARK-10389 > URL: https://issues.apache.org/jira/browse/SPARK-10389 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 > ORDER BY key + 1". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10389) support order by non-attribute grouping expression on Aggregate
[ https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10389: Assignee: (was: Apache Spark) > support order by non-attribute grouping expression on Aggregate > --- > > Key: SPARK-10389 > URL: https://issues.apache.org/jira/browse/SPARK-10389 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 > ORDER BY key + 1". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10389) support order by non-attribute grouping expression on Aggregate
[ https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10389: Assignee: Apache Spark > support order by non-attribute grouping expression on Aggregate > --- > > Key: SPARK-10389 > URL: https://issues.apache.org/jira/browse/SPARK-10389 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > > For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 > ORDER BY key + 1". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
Zoltán Zvara created SPARK-10390: Summary: Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J Key: SPARK-10390 URL: https://issues.apache.org/jira/browse/SPARK-10390 Project: Spark Issue Type: Bug Components: PySpark Reporter: Zoltán Zvara While running PySpark through iPython. {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.collect(RDD.scala:904) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} {{spark-env.sh}} {code} export IPYTHON=1 export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=ipython3 export PYSPARK_DRIVER_PYTHON_OPTS="notebook" {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10391) Spark 1.4.1 released news under news/spark-1-3-1-released.html
Jacek Laskowski created SPARK-10391: --- Summary: Spark 1.4.1 released news under news/spark-1-3-1-released.html Key: SPARK-10391 URL: https://issues.apache.org/jira/browse/SPARK-10391 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.1 Reporter: Jacek Laskowski Priority: Minor The link to the news "Spark 1.4.1 released" is under http://spark.apache.org/news/spark-1-3-1-released.html. It's certainly inconsistent with the other news. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10261) Add @Since annotation to ml.evaluation
[ https://issues.apache.org/jira/browse/SPARK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724975#comment-14724975 ] Tijo Thomas commented on SPARK-10261: - I am working on this issue. > Add @Since annotation to ml.evaluation > -- > > Key: SPARK-10261 > URL: https://issues.apache.org/jira/browse/SPARK-10261 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10392) Pyspark - Wrong DateType support
Maciej Bryński created SPARK-10392: -- Summary: Pyspark - Wrong DateType support Key: SPARK-10392 URL: https://issues.apache.org/jira/browse/SPARK-10392 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Maciej Bryński I have following problem. I created table. {code} CREATE TABLE `spark_test` ( `id` INT(11) NULL, `date` DATE NULL ) COLLATE='utf8_general_ci' ENGINE=InnoDB ; INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); {code} Then I'm trying to read data and date '1970-01-01' is converted to int. This makes rdd incompatible with its own schema. {code} df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test') print(df.collect()) df = sqlCtx.createDataFrame(df.rdd, df.schema) [Row(id=1, date=0)] --- TypeError Traceback (most recent call last) in () 1 df = sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", 'spark_test') 2 print(df.collect()) > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 402 403 if isinstance(data, RDD): --> 404 rdd, schema = self._createFromRDD(data, schema, samplingRatio) 405 else: 406 rdd, schema = self._createFromLocal(data, schema) /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio) 296 rows = rdd.take(10) 297 for row in rows: --> 298 _verify_type(row, schema) 299 300 else: /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1152 "length of fields (%d)" % (len(obj), len(dataType.fields))) 1153 for v, f in zip(obj, dataType.fields): -> 1154 _verify_type(v, f.dataType) 1155 1156 /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1136 # subclass of them can not be fromInternald in JVM 1137 if type(obj) not in _acceptable_types[_type]: -> 1138 raise TypeError("%s can not accept object in type %s" % (dataType, type(obj))) 1139 1140 if isinstance(dataType, ArrayType): TypeError: DateType can not accept object in type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7770) Should GBT validationTol be relative tolerance?
[ https://issues.apache.org/jira/browse/SPARK-7770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7770: --- Assignee: (was: Apache Spark) > Should GBT validationTol be relative tolerance? > --- > > Key: SPARK-7770 > URL: https://issues.apache.org/jira/browse/SPARK-7770 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.mllib, GBT validationTol uses absolute tolerance. Relative > tolerance is arguably easier to set in a meaningful way. Questions: > * Should we change spark.mllib's validationTol meaning? > * Should we use relative tolerance in spark.ml's GBT (once we add validation > support)? > I would vote for changing both to relative tolerance, where the tolerance is > relative to the current loss on the training set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7770) Should GBT validationTol be relative tolerance?
[ https://issues.apache.org/jira/browse/SPARK-7770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725013#comment-14725013 ] Apache Spark commented on SPARK-7770: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8549 > Should GBT validationTol be relative tolerance? > --- > > Key: SPARK-7770 > URL: https://issues.apache.org/jira/browse/SPARK-7770 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.mllib, GBT validationTol uses absolute tolerance. Relative > tolerance is arguably easier to set in a meaningful way. Questions: > * Should we change spark.mllib's validationTol meaning? > * Should we use relative tolerance in spark.ml's GBT (once we add validation > support)? > I would vote for changing both to relative tolerance, where the tolerance is > relative to the current loss on the training set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7770) Should GBT validationTol be relative tolerance?
[ https://issues.apache.org/jira/browse/SPARK-7770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7770: --- Assignee: Apache Spark > Should GBT validationTol be relative tolerance? > --- > > Key: SPARK-7770 > URL: https://issues.apache.org/jira/browse/SPARK-7770 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > In spark.mllib, GBT validationTol uses absolute tolerance. Relative > tolerance is arguably easier to set in a meaningful way. Questions: > * Should we change spark.mllib's validationTol meaning? > * Should we use relative tolerance in spark.ml's GBT (once we add validation > support)? > I would vote for changing both to relative tolerance, where the tolerance is > relative to the current loss on the training set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail
[ https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10301. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8509 [https://github.com/apache/spark/pull/8509] > For struct type, if parquet's global schema has less fields than a file's > schema, data reading will fail > > > Key: SPARK-10301 > URL: https://issues.apache.org/jira/browse/SPARK-10301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.6.0 > > > We hit this issue when reading a complex Parquet dateset without turning on > schema merging. The data set consists of Parquet files with different but > compatible schemas. In this way, the schema of the dataset is defined by > either a summary file or a random physical Parquet file if no summary files > are available. Apparently, this schema may not containing all fields > appeared in all physicla files. > Parquet was designed with schema evolution and column pruning in mind, so it > should be legal for a user to use a tailored schema to read the dataset to > save disk IO. For example, say we have a Parquet dataset consisting of two > physical Parquet files with the following two schemas: > {noformat} > message m0 { > optional group f0 { > optional int64 f00; > optional int64 f01; > } > } > message m1 { > optional group f0 { > optional int64 f01; > optional int64 f01; > optional int64 f02; > } > optional double f1; > } > {noformat} > Users should be allowed to read the dataset with the following schema: > {noformat} > message m1 { > optional group f0 { > optional int64 f01; > optional int64 f02; > } > } > {noformat} > so that {{f0.f00}} and {{f1}} are never touched. The above case can be > expressed by the following {{spark-shell}} snippet: > {noformat} > import sqlContext._ > import sqlContext.implicits._ > import org.apache.spark.sql.types.{LongType, StructType} > val path = "/tmp/spark/parquet" > range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id) AS f0").coalesce(1) > .write.mode("overwrite").parquet(path) > range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0", > "CAST(id AS DOUBLE) AS f1").coalesce(1) > .write.mode("append").parquet(path) > val tailoredSchema = > new StructType() > .add( > "f0", > new StructType() > .add("f01", LongType, nullable = true) > .add("f02", LongType, nullable = true), > nullable = true) > read.schema(tailoredSchema).parquet(path).show() > {noformat} > Expected output should be: > {noformat} > ++ > | f0| > ++ > |[0,null]| > |[1,null]| > |[2,null]| > | [0,0]| > | [1,1]| > | [2,2]| > ++ > {noformat} > However, current 1.5-SNAPSHOT version throws the following exception: > {noformat} > org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in > block -1 in file > hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) >
[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10392: --- Affects Version/s: 1.4.1 > Pyspark - Wrong DateType support > > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data and date '1970-01-01' is converted to int. This > makes rdd incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies
[ https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10370: Assignee: (was: Apache Spark) > After a stages map outputs are registered, all running attempts should be > marked as zombies > --- > > Key: SPARK-10370 > URL: https://issues.apache.org/jira/browse/SPARK-10370 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Imran Rashid > > Follow up to SPARK-5259. During stage retry, its possible for a stage to > "complete" by registering all its map output and starting the downstream > stages, before the latest task set has completed. This will result in the > earlier task set continuing to submit tasks, that are both unnecessary and > increase the chance of hitting SPARK-8029. > Spark should mark all tasks sets for a stage as zombie as soon as its map > output is registered. Note that this involves coordination between the > various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at > least) which isn't easily testable with the current setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies
[ https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725043#comment-14725043 ] Apache Spark commented on SPARK-10370: -- User 'suyanNone' has created a pull request for this issue: https://github.com/apache/spark/pull/8550 > After a stages map outputs are registered, all running attempts should be > marked as zombies > --- > > Key: SPARK-10370 > URL: https://issues.apache.org/jira/browse/SPARK-10370 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Imran Rashid > > Follow up to SPARK-5259. During stage retry, its possible for a stage to > "complete" by registering all its map output and starting the downstream > stages, before the latest task set has completed. This will result in the > earlier task set continuing to submit tasks, that are both unnecessary and > increase the chance of hitting SPARK-8029. > Spark should mark all tasks sets for a stage as zombie as soon as its map > output is registered. Note that this involves coordination between the > various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at > least) which isn't easily testable with the current setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies
[ https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10370: Assignee: Apache Spark > After a stages map outputs are registered, all running attempts should be > marked as zombies > --- > > Key: SPARK-10370 > URL: https://issues.apache.org/jira/browse/SPARK-10370 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Imran Rashid >Assignee: Apache Spark > > Follow up to SPARK-5259. During stage retry, its possible for a stage to > "complete" by registering all its map output and starting the downstream > stages, before the latest task set has completed. This will result in the > earlier task set continuing to submit tasks, that are both unnecessary and > increase the chance of hitting SPARK-8029. > Spark should mark all tasks sets for a stage as zombie as soon as its map > output is registered. Note that this involves coordination between the > various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at > least) which isn't easily testable with the current setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10246) Join in PySpark using a list of column names
[ https://issues.apache.org/jira/browse/SPARK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725045#comment-14725045 ] Alexey Grishchenko commented on SPARK-10246: Cannot reproduce, all the options with multiple conditions work on master branch: {code} >>> df.join(df4, ['name', 'age']).collect() [Row(age=5, name=u'Bob', height=None)] >>> df.join(df4, (df.name == df4.name) & (df.age == df4.age)).collect() [Row(age=5, name=u'Bob', age=5, height=None, name=u'Bob')] >>> cond = [df.name == df4.name, df.age == df4.age] >>> df.join(df4, cond).collect() Row(age=5, name=u'Bob', age=5, height=None, name=u'Bob')] {code} > Join in PySpark using a list of column names > > > Key: SPARK-10246 > URL: https://issues.apache.org/jira/browse/SPARK-10246 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Michal Monselise > > Currently, there are two supported methods to perform a join: join condition > and one column name. > The documentation specifies that the join function can accept a list of > conditions or a list of column names but neither are currently supported. > This is discussed in issue SPARK-7197 as well. > Functionality should match the documentation which currently contains an > example in /spark/python/pyspark/sql/dataframe.py line 560: > >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect() > [Row(name=u'Bob', age=5)] > """ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltán Zvara updated SPARK-10390: - Description: While running PySpark through iPython. {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.collect(RDD.scala:904) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} {{spark-env.sh}} {code} export IPYTHON=1 export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=ipython3 export PYSPARK_DRIVER_PYTHON_OPTS="notebook" {code} Spark built with: {{build/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly --error}} was: While running PySpark through iPython. {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:
[jira] [Updated] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltán Zvara updated SPARK-10390: - Description: While running PySpark through iPython. {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.collect(RDD.scala:904) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} {{spark-env.sh}} {code} export IPYTHON=1 export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=ipython3 export PYSPARK_DRIVER_PYTHON_OPTS="notebook" {code} Spark built with: {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} was: While running PySpark through iPython. {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:
[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10392: --- Description: I have following problem. I created table. {code} CREATE TABLE `spark_test` ( `id` INT(11) NULL, `date` DATE NULL ) COLLATE='utf8_general_ci' ENGINE=InnoDB ; INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); {code} Then I'm trying to read data - date '1970-01-01' is converted to int. This makes data frame incompatible with its own schema. {code} df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test') print(df.collect()) df = sqlCtx.createDataFrame(df.rdd, df.schema) [Row(id=1, date=0)] --- TypeError Traceback (most recent call last) in () 1 df = sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", 'spark_test') 2 print(df.collect()) > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 402 403 if isinstance(data, RDD): --> 404 rdd, schema = self._createFromRDD(data, schema, samplingRatio) 405 else: 406 rdd, schema = self._createFromLocal(data, schema) /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio) 296 rows = rdd.take(10) 297 for row in rows: --> 298 _verify_type(row, schema) 299 300 else: /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1152 "length of fields (%d)" % (len(obj), len(dataType.fields))) 1153 for v, f in zip(obj, dataType.fields): -> 1154 _verify_type(v, f.dataType) 1155 1156 /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1136 # subclass of them can not be fromInternald in JVM 1137 if type(obj) not in _acceptable_types[_type]: -> 1138 raise TypeError("%s can not accept object in type %s" % (dataType, type(obj))) 1139 1140 if isinstance(dataType, ArrayType): TypeError: DateType can not accept object in type {code} was: I have following problem. I created table. {code} CREATE TABLE `spark_test` ( `id` INT(11) NULL, `date` DATE NULL ) COLLATE='utf8_general_ci' ENGINE=InnoDB ; INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); {code} Then I'm trying to read data - date '1970-01-01' is converted to int. This makes rdd incompatible with its own schema. {code} df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test') print(df.collect()) df = sqlCtx.createDataFrame(df.rdd, df.schema) [Row(id=1, date=0)] --- TypeError Traceback (most recent call last) in () 1 df = sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", 'spark_test') 2 print(df.collect()) > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 402 403 if isinstance(data, RDD): --> 404 rdd, schema = self._createFromRDD(data, schema, samplingRatio) 405 else: 406 rdd, schema = self._createFromLocal(data, schema) /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio) 296 rows = rdd.take(10) 297 for row in rows: --> 298 _verify_type(row, schema) 299 300 else: /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1152 "length of fields (%d)" % (len(obj), len(dataType.fields))) 1153 for v, f in zip(obj, dataType.fields): -> 1154 _verify_type(v, f.dataType) 1155 1156 /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1136 # subclass of them can not be fromInternald in JVM 1137 if type(obj) not in _acceptable_types[_type]: -> 1138 raise TypeError("%s can not accept object in type %s" % (dataType, type(obj))) 1139 1140 if isinstance(dataType, ArrayType): TypeError: DateType can not accept object in type {code} > Pyspark - Wrong DateType support > > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions
[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10392: --- Description: I have following problem. I created table. {code} CREATE TABLE `spark_test` ( `id` INT(11) NULL, `date` DATE NULL ) COLLATE='utf8_general_ci' ENGINE=InnoDB ; INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); {code} Then I'm trying to read data - date '1970-01-01' is converted to int. This makes rdd incompatible with its own schema. {code} df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test') print(df.collect()) df = sqlCtx.createDataFrame(df.rdd, df.schema) [Row(id=1, date=0)] --- TypeError Traceback (most recent call last) in () 1 df = sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", 'spark_test') 2 print(df.collect()) > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 402 403 if isinstance(data, RDD): --> 404 rdd, schema = self._createFromRDD(data, schema, samplingRatio) 405 else: 406 rdd, schema = self._createFromLocal(data, schema) /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio) 296 rows = rdd.take(10) 297 for row in rows: --> 298 _verify_type(row, schema) 299 300 else: /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1152 "length of fields (%d)" % (len(obj), len(dataType.fields))) 1153 for v, f in zip(obj, dataType.fields): -> 1154 _verify_type(v, f.dataType) 1155 1156 /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1136 # subclass of them can not be fromInternald in JVM 1137 if type(obj) not in _acceptable_types[_type]: -> 1138 raise TypeError("%s can not accept object in type %s" % (dataType, type(obj))) 1139 1140 if isinstance(dataType, ArrayType): TypeError: DateType can not accept object in type {code} was: I have following problem. I created table. {code} CREATE TABLE `spark_test` ( `id` INT(11) NULL, `date` DATE NULL ) COLLATE='utf8_general_ci' ENGINE=InnoDB ; INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); {code} Then I'm trying to read data and date '1970-01-01' is converted to int. This makes rdd incompatible with its own schema. {code} df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test') print(df.collect()) df = sqlCtx.createDataFrame(df.rdd, df.schema) [Row(id=1, date=0)] --- TypeError Traceback (most recent call last) in () 1 df = sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", 'spark_test') 2 print(df.collect()) > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 402 403 if isinstance(data, RDD): --> 404 rdd, schema = self._createFromRDD(data, schema, samplingRatio) 405 else: 406 rdd, schema = self._createFromLocal(data, schema) /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio) 296 rows = rdd.take(10) 297 for row in rows: --> 298 _verify_type(row, schema) 299 300 else: /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1152 "length of fields (%d)" % (len(obj), len(dataType.fields))) 1153 for v, f in zip(obj, dataType.fields): -> 1154 _verify_type(v, f.dataType) 1155 1156 /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1136 # subclass of them can not be fromInternald in JVM 1137 if type(obj) not in _acceptable_types[_type]: -> 1138 raise TypeError("%s can not accept object in type %s" % (dataType, type(obj))) 1139 1140 if isinstance(dataType, ArrayType): TypeError: DateType can not accept object in type {code} > Pyspark - Wrong DateType support > > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4
[jira] [Commented] (SPARK-7544) pyspark.sql.types.Row should implement __getitem__
[ https://issues.apache.org/jira/browse/SPARK-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725066#comment-14725066 ] Maciej Bryński commented on SPARK-7544: --- Will this PR be added to spark ? > pyspark.sql.types.Row should implement __getitem__ > -- > > Key: SPARK-7544 > URL: https://issues.apache.org/jira/browse/SPARK-7544 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Reporter: Nicholas Chammas >Priority: Minor > > Following from the related discussions in [SPARK-7505] and [SPARK-7133], the > {{Row}} type should implement {{\_\_getitem\_\_}} so that people can do this > {code} > row['field'] > {code} > instead of this: > {code} > row.field > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltán Zvara updated SPARK-10390: - Description: While running PySpark through iPython. {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.collect(RDD.scala:904) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} {{spark-env.sh}} {code} export IPYTHON=1 export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=ipython3 export PYSPARK_DRIVER_PYTHON_OPTS="notebook" {code} Spark built with: {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} Not a problem, when built against {{Hadoop 2.4}}! was: While running PySpark through iPython. {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apach
[jira] [Resolved] (SPARK-10391) Spark 1.4.1 released news under news/spark-1-3-1-released.html
[ https://issues.apache.org/jira/browse/SPARK-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10391. --- Resolution: Fixed Assignee: Sean Owen Fix Version/s: 1.5.0 Fixed and pushed a revision to the site. Make sure to refresh in your browser to get the new HTML with the fixed link for the "Spark 1.4.1 released" news item. > Spark 1.4.1 released news under news/spark-1-3-1-released.html > -- > > Key: SPARK-10391 > URL: https://issues.apache.org/jira/browse/SPARK-10391 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.1 >Reporter: Jacek Laskowski >Assignee: Sean Owen >Priority: Minor > Fix For: 1.5.0 > > > The link to the news "Spark 1.4.1 released" is under > http://spark.apache.org/news/spark-1-3-1-released.html. It's certainly > inconsistent with the other news. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10393) use ML pipeline in LDA example
yuhao yang created SPARK-10393: -- Summary: use ML pipeline in LDA example Key: SPARK-10393 URL: https://issues.apache.org/jira/browse/SPARK-10393 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Priority: Minor Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725073#comment-14725073 ] Sean Owen commented on SPARK-10390: --- This means you've pulled in a later version of Guava. Make sure you didn't package anything >14 with your app, perhaps by accidentally bringing in Hadoop deps. I don't think this is a Spark problem (at least, not given the history of why Guava can't be entirely shaded etc) > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size
[ https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10314: -- Fix Version/s: (was: 1.6.0) [~wangxiaoyu] Don't set Fix version; it's not resolved. > [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception > when parallelism is big than data split size > > > Key: SPARK-10314 > URL: https://issues.apache.org/jira/browse/SPARK-10314 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.4.1 > Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4 >Reporter: Xiaoyu Wang >Priority: Minor > > RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when > parallelism is big than data split size > {code} > val rdd = sc.parallelize(List(1, 2),2) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > is ok. > {code} > val rdd = sc.parallelize(List(1, 2),3) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > got exceptoin: > {noformat} > 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24 > 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 > output partitions (allowLocal=false) > 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at > :24) > 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List() > 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List() > 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 > (ParallelCollectionRDD[0] at parallelize at :21), which has no > missing parents > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with > curMem=0, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 1096.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with > curMem=1096, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 788.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on localhost:43776 (size: 788.0 B, free: 706.9 MB) > 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:874 > 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from > ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21) > 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1269 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) > 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it > 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started > 15/08/27 17:53:08 WARN : tachyon.home is not set. Using > /mnt/tachyon_default_home as the default value. > 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect > master @ localhost/127.0.0.1:19998 > 15/08/27 17:53:08 INFO : User registered at the master > localhost/127.0.0.1:19998 got UserId 109 > 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at > /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5 > 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost > 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998 > 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was > created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 > was created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 > was created! > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore > on localhost:43776 (size: 0.0 B) > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore > on localhost:43776 (size: 2.0 B) > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_2 on ExternalBlockStore > on localhost:43776 (size: 2.0 B) > 15/08/27 17:53:08 INFO BlockManager: Found block rdd_0_
[jira] [Commented] (SPARK-10393) use ML pipeline in LDA example
[ https://issues.apache.org/jira/browse/SPARK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725078#comment-14725078 ] Apache Spark commented on SPARK-10393: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/8551 > use ML pipeline in LDA example > -- > > Key: SPARK-10393 > URL: https://issues.apache.org/jira/browse/SPARK-10393 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: yuhao yang >Priority: Minor > > Since the logic of the text processing part has been moved to ML > estimators/transformers, replace the related code in LDA Example with the ML > pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10393) use ML pipeline in LDA example
[ https://issues.apache.org/jira/browse/SPARK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10393: Assignee: (was: Apache Spark) > use ML pipeline in LDA example > -- > > Key: SPARK-10393 > URL: https://issues.apache.org/jira/browse/SPARK-10393 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: yuhao yang >Priority: Minor > > Since the logic of the text processing part has been moved to ML > estimators/transformers, replace the related code in LDA Example with the ML > pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10393) use ML pipeline in LDA example
[ https://issues.apache.org/jira/browse/SPARK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10393: Assignee: Apache Spark > use ML pipeline in LDA example > -- > > Key: SPARK-10393 > URL: https://issues.apache.org/jira/browse/SPARK-10393 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: yuhao yang >Assignee: Apache Spark >Priority: Minor > > Since the logic of the text processing part has been moved to ML > estimators/transformers, replace the related code in LDA Example with the ML > pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9089. -- Resolution: Cannot Reproduce > Failing to run simple job on Spark Standalone Cluster > - > > Key: SPARK-9089 > URL: https://issues.apache.org/jira/browse/SPARK-9089 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 > Environment: Staging >Reporter: Amar Goradia >Priority: Critical > > We are trying out Spark and as part of that, we have setup Standalone Spark > Cluster. As part of testing things out, we simple open PySpark shell and ran > this simple job: a=sc.parallelize([1,2,3]).count() > As a result, we are getting errors. We tried googling around this error but > haven't been able to find exact reasoning behind why we are running into this > state. Can somebody please help us further look into this issue and advise us > on what we are missing here? > Here is full error stack: > >>> a=sc.parallelize([1,2,3]).count() > 15/07/16 00:52:15 INFO SparkContext: Starting job: count at :1 > 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at :1) with 2 > output partitions (allowLocal=false) > 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at > :1) > 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List() > 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List() > 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] > at count at :1), which has no missing parents > 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5 > 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at :1) > failed in Unknown s > 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at :1, took > 0.004963 s > Traceback (most recent call last): > File "", line 1, in > File "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py", line > 972, in count > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > File "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py", line > 963, in sum > return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) > File "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py", line > 771, in reduce > vals = self.mapPartitions(func).collect() > File "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py", line > 745, in collect > port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) > File > "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.lang.reflect.InvocationTargetException > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > java.lang.reflect.Constructor.newInstance(Constructor.java:526) > org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68) > org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60) > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:80) > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874) > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815) > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) > org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) > at > o
[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10392: --- Description: I have following problem. I created table. {code} CREATE TABLE `spark_test` ( `id` INT(11) NULL, `date` DATE NULL ) COLLATE='utf8_general_ci' ENGINE=InnoDB ; INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); {code} Then I'm trying to read data - date '1970-01-01' is converted to int. This makes data frame incompatible with its own schema. {code} df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test') print(df.collect()) df = sqlCtx.createDataFrame(df.rdd, df.schema) [Row(id=1, date=0)] --- TypeError Traceback (most recent call last) in () 1 df = sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", 'spark_test') 2 print(df.collect()) > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 402 403 if isinstance(data, RDD): --> 404 rdd, schema = self._createFromRDD(data, schema, samplingRatio) 405 else: 406 rdd, schema = self._createFromLocal(data, schema) /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio) 296 rows = rdd.take(10) 297 for row in rows: --> 298 _verify_type(row, schema) 299 300 else: /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1152 "length of fields (%d)" % (len(obj), len(dataType.fields))) 1153 for v, f in zip(obj, dataType.fields): -> 1154 _verify_type(v, f.dataType) 1155 1156 /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1136 # subclass of them can not be fromInternald in JVM 1137 if type(obj) not in _acceptable_types[_type]: -> 1138 raise TypeError("%s can not accept object in type %s" % (dataType, type(obj))) 1139 1140 if isinstance(dataType, ArrayType): TypeError: DateType can not accept object in type {code} was: I have following problem. I created table. {code} CREATE TABLE `spark_test` ( `id` INT(11) NULL, `date` DATE NULL ) COLLATE='utf8_general_ci' ENGINE=InnoDB ; INSERT INTO `sandbox`.`spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); {code} Then I'm trying to read data - date '1970-01-01' is converted to int. This makes data frame incompatible with its own schema. {code} df = sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 'spark_test') print(df.collect()) df = sqlCtx.createDataFrame(df.rdd, df.schema) [Row(id=1, date=0)] --- TypeError Traceback (most recent call last) in () 1 df = sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", 'spark_test') 2 print(df.collect()) > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 402 403 if isinstance(data, RDD): --> 404 rdd, schema = self._createFromRDD(data, schema, samplingRatio) 405 else: 406 rdd, schema = self._createFromLocal(data, schema) /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio) 296 rows = rdd.take(10) 297 for row in rows: --> 298 _verify_type(row, schema) 299 300 else: /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1152 "length of fields (%d)" % (len(obj), len(dataType.fields))) 1153 for v, f in zip(obj, dataType.fields): -> 1154 _verify_type(v, f.dataType) 1155 1156 /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1136 # subclass of them can not be fromInternald in JVM 1137 if type(obj) not in _acceptable_types[_type]: -> 1138 raise TypeError("%s can not accept object in type %s" % (dataType, type(obj))) 1139 1140 if isinstance(dataType, ArrayType): TypeError: DateType can not accept object in type {code} > Pyspark - Wrong DateType support > > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1
[jira] [Commented] (SPARK-9878) ReduceByKey + FullOuterJoin return 0 element if using an empty RDD
[ https://issues.apache.org/jira/browse/SPARK-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725117#comment-14725117 ] Alexey Grishchenko commented on SPARK-9878: --- Not reproduced on master: {code} scala> println("ok :"+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])]).count) ok :2 scala> println("ko: "+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])].reduceByKey((e1, e2) => e1 ++ e2)).count) ko: 2 {code} > ReduceByKey + FullOuterJoin return 0 element if using an empty RDD > --- > > Key: SPARK-9878 > URL: https://issues.apache.org/jira/browse/SPARK-9878 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 > Environment: linux ubuntu 64b spark-hadoop > launched with Local[2] >Reporter: durand remi >Priority: Minor > > code to reproduce: > println("ok > :"+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])]).count) > println("ko: > "+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])].reduceByKey((e1, > e2) => e1 ++ e2)).count) > what i expect: > ok: 2 > ko: 2 > but what i have: > ok: 2 > ko: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9878) ReduceByKey + FullOuterJoin return 0 element if using an empty RDD
[ https://issues.apache.org/jira/browse/SPARK-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9878. -- Resolution: Cannot Reproduce Agree, I also can't reproduce this. > ReduceByKey + FullOuterJoin return 0 element if using an empty RDD > --- > > Key: SPARK-9878 > URL: https://issues.apache.org/jira/browse/SPARK-9878 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 > Environment: linux ubuntu 64b spark-hadoop > launched with Local[2] >Reporter: durand remi >Priority: Minor > > code to reproduce: > println("ok > :"+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])]).count) > println("ko: > "+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])].reduceByKey((e1, > e2) => e1 ++ e2)).count) > what i expect: > ok: 2 > ko: 2 > but what i have: > ok: 2 > ko: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8730) Deser primitive class with Java serialization
[ https://issues.apache.org/jira/browse/SPARK-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8730: - Assignee: Eugen Cepoi > Deser primitive class with Java serialization > - > > Key: SPARK-8730 > URL: https://issues.apache.org/jira/browse/SPARK-8730 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Eugen Cepoi >Assignee: Eugen Cepoi >Priority: Critical > Fix For: 1.6.0 > > > Objects that contain as property a primitive Class, can not be deserialized > using java serde. Class.forName does not work for primitives. > Exemple of object: > class Foo extends Serializable { > val intClass = classOf[Int] > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4
[ https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10374: -- Component/s: Build > Spark-core 1.5.0-RC2 can create version conflicts with apps depending on > protobuf-2.4 > - > > Key: SPARK-10374 > URL: https://issues.apache.org/jira/browse/SPARK-10374 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.0 >Reporter: Matt Cheah > > My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that > depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. > When I run the driver application, I can hit the following error: > {code} > … java.lang.UnsupportedOperationException: This is > supposed to be overridden by subclasses. > at > com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108) > at > com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149) > {code} > This application used to work when pulling in Spark 1.4.1 dependencies, and > thus this is a regression. > I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark > 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf > 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark > modules. It appears that Spark used to shade its protobuf dependencies and > hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However > when I ran dependencyInsight again against Spark 1.5 and it looks like > protobuf is no longer shaded from the Spark module. > 1.4.1 dependencyInsight: > {code} > com.google.protobuf:protobuf-java:2.4.0a > +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0 > |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 > | +--- compile > | \--- org.apache.spark:spark-core_2.10:1.4.1 > | +--- compile > | +--- org.apache.spark:spark-sql_2.10:1.4.1 > | |\--- compile > | \--- org.apache.spark:spark-catalyst_2.10:1.4.1 > | \--- org.apache.spark:spark-sql_2.10:1.4.1 (*) > \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0 > \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*) > org.spark-project.protobuf:protobuf-java:2.5.0-spark > \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark > \--- org.apache.spark:spark-core_2.10:1.4.1 > +--- compile > +--- org.apache.spark:spark-sql_2.10:1.4.1 > |\--- compile > \--- org.apache.spark:spark-catalyst_2.10:1.4.1 >\--- org.apache.spark:spark-sql_2.10:1.4.1 (*) > {code} > 1.5.0-rc2 dependencyInsight: > {code} > com.google.protobuf:protobuf-java:2.5.0 (conflict resolution) > \--- com.typesafe.akka:akka-remote_2.10:2.3.11 > \--- org.apache.spark:spark-core_2.10:1.5.0-rc2 > +--- compile > +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 > |\--- compile > \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2 >\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*) > com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0 > +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0 > |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 > | +--- compile > | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2 > | +--- compile > | +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 > | |\--- compile > | \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2 > | \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*) > \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0 > \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*) > {code} > Clearly we can't force the version to be one way or the other. If I force > protobuf to use 2.5.0, then invoking Hadoop code from my application will > break as Hadoop 2.0.0 jars are compiled against protobuf-2.4. On the other > hand, forcing protobuf to use version 2.4 breaks spark-core code that is > compiled against protobuf-2.5. Note that protobuf-2.4 and protobuf-2.5 are > not binary compatible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10394) Make GBTParams use shared "stepSize"
Yanbo Liang created SPARK-10394: --- Summary: Make GBTParams use shared "stepSize" Key: SPARK-10394 URL: https://issues.apache.org/jira/browse/SPARK-10394 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang Priority: Minor GBTParams has "stepSize" as learning rate currently. ML has shared param class "HasStepSize", GBTParams can extend from it rather than duplicated implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10394) Make GBTParams use shared "stepSize"
[ https://issues.apache.org/jira/browse/SPARK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725133#comment-14725133 ] Apache Spark commented on SPARK-10394: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8552 > Make GBTParams use shared "stepSize" > > > Key: SPARK-10394 > URL: https://issues.apache.org/jira/browse/SPARK-10394 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > GBTParams has "stepSize" as learning rate currently. > ML has shared param class "HasStepSize", GBTParams can extend from it rather > than duplicated implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10394) Make GBTParams use shared "stepSize"
[ https://issues.apache.org/jira/browse/SPARK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10394: Assignee: (was: Apache Spark) > Make GBTParams use shared "stepSize" > > > Key: SPARK-10394 > URL: https://issues.apache.org/jira/browse/SPARK-10394 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > GBTParams has "stepSize" as learning rate currently. > ML has shared param class "HasStepSize", GBTParams can extend from it rather > than duplicated implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10394) Make GBTParams use shared "stepSize"
[ https://issues.apache.org/jira/browse/SPARK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10394: Assignee: Apache Spark > Make GBTParams use shared "stepSize" > > > Key: SPARK-10394 > URL: https://issues.apache.org/jira/browse/SPARK-10394 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > GBTParams has "stepSize" as learning rate currently. > ML has shared param class "HasStepSize", GBTParams can extend from it rather > than duplicated implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9622) DecisionTreeRegressor: provide variance of prediction
[ https://issues.apache.org/jira/browse/SPARK-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725137#comment-14725137 ] Yanbo Liang commented on SPARK-9622: I agree to return a Double column of variances at present. I will try to submit PR. > DecisionTreeRegressor: provide variance of prediction > - > > Key: SPARK-9622 > URL: https://issues.apache.org/jira/browse/SPARK-9622 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Variance of predicted value, as estimated from training data. > Analogous to class probabilities for classification. > See [SPARK-3727] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10395) Simplify CatalystReadSupport
Cheng Lian created SPARK-10395: -- Summary: Simplify CatalystReadSupport Key: SPARK-10395 URL: https://issues.apache.org/jira/browse/SPARK-10395 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor The API interface of Parquet {{ReadSupport}} is a little bit over complicated because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 and prior), {{ReadSupport}} need to be instantiated and initialized twice on both driver side and executor side. The {{init()}} method is for driver side initialization, while {{prepareForRead()}} is for executor side. However, starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} is only instantiated and initialized on executor side. So, theoretically, now it's totally fine to combine these two methods into a single initialization method. The only reason (I could think of) to still have them here is for parquet-mr API backwards-compatibility. Due to this reason, we no longer need to rely on {{ReadContext}} to pass requested schema from {{init()}} to {{prepareForRead()}}, using a private `var` for requested schema in {{CatalystReadSupport}} would be enough. Another thing is that, after removing the old Parquet support code, now we always set Catalyst requested schema properly when reading Parquet files. So all those "fallback" logic in {{CatalystReadSupport}} is now redundant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10395) Simplify CatalystReadSupport
[ https://issues.apache.org/jira/browse/SPARK-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725175#comment-14725175 ] Apache Spark commented on SPARK-10395: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8553 > Simplify CatalystReadSupport > > > Key: SPARK-10395 > URL: https://issues.apache.org/jira/browse/SPARK-10395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > The API interface of Parquet {{ReadSupport}} is a little bit over complicated > because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 > and prior), {{ReadSupport}} need to be instantiated and initialized twice on > both driver side and executor side. The {{init()}} method is for driver side > initialization, while {{prepareForRead()}} is for executor side. However, > starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} > is only instantiated and initialized on executor side. So, theoretically, > now it's totally fine to combine these two methods into a single > initialization method. The only reason (I could think of) to still have them > here is for parquet-mr API backwards-compatibility. > Due to this reason, we no longer need to rely on {{ReadContext}} to pass > requested schema from {{init()}} to {{prepareForRead()}}, using a private > `var` for requested schema in {{CatalystReadSupport}} would be enough. > Another thing is that, after removing the old Parquet support code, now we > always set Catalyst requested schema properly when reading Parquet files. So > all those "fallback" logic in {{CatalystReadSupport}} is now redundant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10395) Simplify CatalystReadSupport
[ https://issues.apache.org/jira/browse/SPARK-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10395: Assignee: Cheng Lian (was: Apache Spark) > Simplify CatalystReadSupport > > > Key: SPARK-10395 > URL: https://issues.apache.org/jira/browse/SPARK-10395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > The API interface of Parquet {{ReadSupport}} is a little bit over complicated > because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 > and prior), {{ReadSupport}} need to be instantiated and initialized twice on > both driver side and executor side. The {{init()}} method is for driver side > initialization, while {{prepareForRead()}} is for executor side. However, > starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} > is only instantiated and initialized on executor side. So, theoretically, > now it's totally fine to combine these two methods into a single > initialization method. The only reason (I could think of) to still have them > here is for parquet-mr API backwards-compatibility. > Due to this reason, we no longer need to rely on {{ReadContext}} to pass > requested schema from {{init()}} to {{prepareForRead()}}, using a private > `var` for requested schema in {{CatalystReadSupport}} would be enough. > Another thing is that, after removing the old Parquet support code, now we > always set Catalyst requested schema properly when reading Parquet files. So > all those "fallback" logic in {{CatalystReadSupport}} is now redundant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10395) Simplify CatalystReadSupport
[ https://issues.apache.org/jira/browse/SPARK-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10395: Assignee: Apache Spark (was: Cheng Lian) > Simplify CatalystReadSupport > > > Key: SPARK-10395 > URL: https://issues.apache.org/jira/browse/SPARK-10395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Minor > > The API interface of Parquet {{ReadSupport}} is a little bit over complicated > because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 > and prior), {{ReadSupport}} need to be instantiated and initialized twice on > both driver side and executor side. The {{init()}} method is for driver side > initialization, while {{prepareForRead()}} is for executor side. However, > starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} > is only instantiated and initialized on executor side. So, theoretically, > now it's totally fine to combine these two methods into a single > initialization method. The only reason (I could think of) to still have them > here is for parquet-mr API backwards-compatibility. > Due to this reason, we no longer need to rely on {{ReadContext}} to pass > requested schema from {{init()}} to {{prepareForRead()}}, using a private > `var` for requested schema in {{CatalystReadSupport}} would be enough. > Another thing is that, after removing the old Parquet support code, now we > always set Catalyst requested schema properly when reading Parquet files. So > all those "fallback" logic in {{CatalystReadSupport}} is now redundant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725189#comment-14725189 ] Zoltán Zvara commented on SPARK-10390: -- I did not packed Guava with my app, this is a clean Spark build in terms of dependencies, built with: {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725189#comment-14725189 ] Zoltán Zvara edited comment on SPARK-10390 at 9/1/15 11:03 AM: --- I did not pack Guava with my app, this is a clean Spark build in terms of dependencies, built with: {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} was (Author: ehnalis): I did not packed Guava with my app, this is a clean Spark build in terms of dependencies, built with: {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725199#comment-14725199 ] Sean Owen commented on SPARK-10390: --- It definitely means you have a later version of Guava in your deployment somehow, than Spark or Hadoop expects. The version you have packaged doesn't contain a method that the older one does. Try the Maven build, to narrow it down? it's the built of reference, not SBT. > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725211#comment-14725211 ] Vinod KC commented on SPARK-10199: -- [~mengxr] I've measured the overhead of reflexion in save/load operation, please refer the results in this link https://github.com/vinodkc/xtique/blob/master/overhead_duetoReflection.csv Also I've measured the performance gain in save/load methods without reflexion after taking average of 5 times test executions Please refer the performance gain % in this two links https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_save.csv https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_load.csv > Avoid using reflections for parquet model save > -- > > Key: SPARK-10199 > URL: https://issues.apache.org/jira/browse/SPARK-10199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Feynman Liang >Priority: Minor > > These items are not high priority since the overhead writing to Parquest is > much greater than for runtime reflections. > Multiple model save/load in MLlib use case classes to infer a schema for the > data frame saved to Parquet. However, inferring a schema from case classes or > tuples uses [runtime > reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] > which is unnecessary since the types are already known at the time `save` is > called. > It would be better to just specify the schema for the data frame directly > using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10396) spark-sql ctrl+c does not exit
linbao111 created SPARK-10396: - Summary: spark-sql ctrl+c does not exit Key: SPARK-10396 URL: https://issues.apache.org/jira/browse/SPARK-10396 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: linbao111 if you type "ctrl+c",spark-sql process exit(yarn-client mode),but you can still see spark job on cluster job browser,which redirect to dirver host 4040 sparkui service -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10396) spark-sql ctrl+c does not exit
[ https://issues.apache.org/jira/browse/SPARK-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10396. --- Resolution: Duplicate It's helpful if you can please search JIRA first. It easy to find several issues on the same topic. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a JIRA. > spark-sql ctrl+c does not exit > -- > > Key: SPARK-10396 > URL: https://issues.apache.org/jira/browse/SPARK-10396 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: linbao111 > > if you type "ctrl+c",spark-sql process exit(yarn-client mode),but you can > still see spark job on cluster job browser,which redirect to dirver host 4040 > sparkui service -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725284#comment-14725284 ] Rajeev Reddy commented on SPARK-5226: - Hello Aliaksei Litouka, I have looked into your implementation you are taking coordinate points i.e double as input for clustering can you please tell me how I can extend this for clustering a set of Text Documents > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10392: --- Summary: Pyspark - Wrong DateType support on JDBC connection (was: Pyspark - Wrong DateType support) > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10261) Add @Since annotation to ml.evaluation
[ https://issues.apache.org/jira/browse/SPARK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725300#comment-14725300 ] Apache Spark commented on SPARK-10261: -- User 'tijoparacka' has created a pull request for this issue: https://github.com/apache/spark/pull/8554 > Add @Since annotation to ml.evaluation > -- > > Key: SPARK-10261 > URL: https://issues.apache.org/jira/browse/SPARK-10261 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10261) Add @Since annotation to ml.evaluation
[ https://issues.apache.org/jira/browse/SPARK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10261: Assignee: (was: Apache Spark) > Add @Since annotation to ml.evaluation > -- > > Key: SPARK-10261 > URL: https://issues.apache.org/jira/browse/SPARK-10261 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10261) Add @Since annotation to ml.evaluation
[ https://issues.apache.org/jira/browse/SPARK-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10261: Assignee: Apache Spark > Add @Since annotation to ml.evaluation > -- > > Key: SPARK-10261 > URL: https://issues.apache.org/jira/browse/SPARK-10261 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10162) PySpark filters with datetimes mess up when datetimes have timezones.
[ https://issues.apache.org/jira/browse/SPARK-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725329#comment-14725329 ] Apache Spark commented on SPARK-10162: -- User '0x0FFF' has created a pull request for this issue: https://github.com/apache/spark/pull/8555 > PySpark filters with datetimes mess up when datetimes have timezones. > - > > Key: SPARK-10162 > URL: https://issues.apache.org/jira/browse/SPARK-10162 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Kevin Cox > > PySpark appears to ignore timezone information when filtering on (and working > in general with) datetimes. > Please see the example below. The generated filter in the query plan is 5 > hours off (my computer is EST). > {code} > In [1]: df = sc.sql.createDataFrame([], StructType([StructField("dt", > TimestampType())])) > In [2]: df.filter(df.dt > datetime(2000, 01, 01, tzinfo=UTC)).explain() > Filter (dt#9 > 9467028) > Scan PhysicalRDD[dt#9] > {code} > Note that 9467028 == Sat 1 Jan 2000 05:00:00 UTC -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10397) Make Python's SparkContext self-descriptive on "print sc"
Sergey Tryuber created SPARK-10397: -- Summary: Make Python's SparkContext self-descriptive on "print sc" Key: SPARK-10397 URL: https://issues.apache.org/jira/browse/SPARK-10397 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.4.0 Reporter: Sergey Tryuber Priority: Trivial When I execute in Python shell: {code} print sc {code} I receive something like: {noformat} {noformat} But this is very inconvenient, especially if a user wants to create a good-looking and self-descriptive IPython Notebook. He would like to see some information about his Spark cluster. In contrast, H2O context does have this feature and it is very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10262) Add @Since annotation to ml.attribute
[ https://issues.apache.org/jira/browse/SPARK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725388#comment-14725388 ] Tijo Thomas commented on SPARK-10262: - I am working on this. > Add @Since annotation to ml.attribute > - > > Key: SPARK-10262 > URL: https://issues.apache.org/jira/browse/SPARK-10262 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725390#comment-14725390 ] Alexey Grishchenko commented on SPARK-10392: This is a corner case for DateType.fromInternal implementation: {code} >>> from pyspark.sql.types import * >>> a = DateType() >>> a.fromInternal(0) 0 >>> a.fromInternal(1) datetime.date(1970, 1, 2) {code} > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10392: Assignee: Apache Spark > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński >Assignee: Apache Spark > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725404#comment-14725404 ] Maciej Bryński commented on SPARK-10392: {code} class DateType(AtomicType): """Date (datetime.date) data type. """ def fromInternal(self, v): *return v* and datetime.date.fromordinal(v + self.EPOCH_ORDINAL) {code} Yep, With v = 0 there is no conversion to date. > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10392: Assignee: (was: Apache Spark) > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725404#comment-14725404 ] Maciej Bryński edited comment on SPARK-10392 at 9/1/15 1:38 PM: {code} class DateType(AtomicType): """Date (datetime.date) data type. """ def fromInternal(self, v): return v and datetime.date.fromordinal(v + self.EPOCH_ORDINAL) {code} Yep, With v = 0 there is no conversion to date. was (Author: maver1ck): {code} class DateType(AtomicType): """Date (datetime.date) data type. """ def fromInternal(self, v): *return v* and datetime.date.fromordinal(v + self.EPOCH_ORDINAL) {code} Yep, With v = 0 there is no conversion to date. > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725403#comment-14725403 ] Apache Spark commented on SPARK-10392: -- User '0x0FFF' has created a pull request for this issue: https://github.com/apache/spark/pull/8556 > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10392: --- Comment: was deleted (was: {code} class DateType(AtomicType): """Date (datetime.date) data type. """ def fromInternal(self, v): return v and datetime.date.fromordinal(v + self.EPOCH_ORDINAL) {code} Yep, With v = 0 there is no conversion to date.) > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
Luciano Resende created SPARK-10398: --- Summary: Migrate Spark download page to use new lua mirroring scripts Key: SPARK-10398 URL: https://issues.apache.org/jira/browse/SPARK-10398 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Luciano Resende >From infra team : If you refer to www.apache.org/dyn/closer.cgi, please refer to www.apache.org/dyn/closer.lua instead from now on. Any non-conforming CGI scripts are no longer enabled, and are all rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10398: -- Assignee: Sean Owen Priority: Minor (was: Major) Issue Type: Task (was: Bug) No problem, pushing the change now. > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Sean Owen >Priority: Minor > Fix For: 1.5.0 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10398. --- Resolution: Fixed Fix Version/s: 1.5.0 > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Sean Owen >Priority: Minor > Fix For: 1.5.0 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725211#comment-14725211 ] Vinod KC edited comment on SPARK-10199 at 9/1/15 2:15 PM: -- [~mengxr] I've measured the overhead of reflection in save/load operation, please refer the results in this link https://github.com/vinodkc/xtique/blob/master/overhead_duetoReflection.csv Also I've measured the performance gain in save/load methods without reflection after taking average of 5 times test executions Please refer the performance gain % in this two links https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_save.csv https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_load.csv was (Author: vinodkc): [~mengxr] I've measured the overhead of reflexion in save/load operation, please refer the results in this link https://github.com/vinodkc/xtique/blob/master/overhead_duetoReflection.csv Also I've measured the performance gain in save/load methods without reflexion after taking average of 5 times test executions Please refer the performance gain % in this two links https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_save.csv https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_load.csv > Avoid using reflections for parquet model save > -- > > Key: SPARK-10199 > URL: https://issues.apache.org/jira/browse/SPARK-10199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Feynman Liang >Priority: Minor > > These items are not high priority since the overhead writing to Parquest is > much greater than for runtime reflections. > Multiple model save/load in MLlib use case classes to infer a schema for the > data frame saved to Parquet. However, inferring a schema from case classes or > tuples uses [runtime > reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] > which is unnecessary since the types are already known at the time `save` is > called. > It would be better to just specify the schema for the data frame directly > using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)
Paul Weiss created SPARK-10399: -- Summary: Off Heap Memory Access for non-JVM libraries (C++) Key: SPARK-10399 URL: https://issues.apache.org/jira/browse/SPARK-10399 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Paul Weiss *Summary* Provide direct off-heap memory access to an external non-JVM program such as a c++ library within the Spark running JVM/executor. As Spark moves to storing all data into off heap memory it makes sense to provide access points to the memory for non-JVM programs. *Assumptions* * Zero copies will be made during the call into non-JVM library * Access into non-JVM libraries will be accomplished via JNI * A generic JNI interface will be created so that developers will not need to deal with the raw JNI call * C++ will be the initial target non-JVM use case * memory management will remain on the JVM/Spark side * the API from C++ will be similar to dataframes as much as feasible and NOT require expert knowledge of JNI * Data organization and layout will support complex (multi-type, nested, etc.) types *Design* * Initially Spark JVM -> non-JVM will be supported * Creating an embedded JVM with Spark running from a non-JVM program is initially out of scope *Technical* * GetDirectBufferAddress is the JNI call used to access byte buffer without copy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work
[ https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725473#comment-14725473 ] Alex Rovner commented on SPARK-10375: - May I suggest throwing an exception when certain properties are set that will not take effect? (spark.driver.*) > Setting the driver memory with SparkConf().set("spark.driver.memory","1g") > does not work > > > Key: SPARK-10375 > URL: https://issues.apache.org/jira/browse/SPARK-10375 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: Running with yarn >Reporter: Thomas >Priority: Minor > > When running pyspark 1.3.0 with yarn, the following code has no effect: > pyspark.SparkConf().set("spark.driver.memory","1g") > The Environment tab in yarn shows that the driver has 1g, however, the > Executors tab only shows 512 M (the default value) for the driver memory. > This issue goes away when the driver memory is specified via the command line > (i.e. --driver-memory 1g) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work
[ https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10375: -- Issue Type: Improvement (was: Bug) > Setting the driver memory with SparkConf().set("spark.driver.memory","1g") > does not work > > > Key: SPARK-10375 > URL: https://issues.apache.org/jira/browse/SPARK-10375 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.3.0 > Environment: Running with yarn >Reporter: Thomas >Priority: Minor > > When running pyspark 1.3.0 with yarn, the following code has no effect: > pyspark.SparkConf().set("spark.driver.memory","1g") > The Environment tab in yarn shows that the driver has 1g, however, the > Executors tab only shows 512 M (the default value) for the driver memory. > This issue goes away when the driver memory is specified via the command line > (i.e. --driver-memory 1g) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size
[ https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725530#comment-14725530 ] Xiaoyu Wang commented on SPARK-10314: - Yes,Any questions with the pull request? Do you need me to resubmit a pull request for master branch? The previous pull request is submit to branch-1.4! > [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception > when parallelism is big than data split size > > > Key: SPARK-10314 > URL: https://issues.apache.org/jira/browse/SPARK-10314 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.4.1 > Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4 >Reporter: Xiaoyu Wang >Priority: Minor > > RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when > parallelism is big than data split size > {code} > val rdd = sc.parallelize(List(1, 2),2) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > is ok. > {code} > val rdd = sc.parallelize(List(1, 2),3) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > got exceptoin: > {noformat} > 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24 > 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 > output partitions (allowLocal=false) > 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at > :24) > 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List() > 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List() > 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 > (ParallelCollectionRDD[0] at parallelize at :21), which has no > missing parents > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with > curMem=0, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 1096.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with > curMem=1096, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 788.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on localhost:43776 (size: 788.0 B, free: 706.9 MB) > 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:874 > 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from > ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21) > 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1269 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) > 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it > 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started > 15/08/27 17:53:08 WARN : tachyon.home is not set. Using > /mnt/tachyon_default_home as the default value. > 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect > master @ localhost/127.0.0.1:19998 > 15/08/27 17:53:08 INFO : User registered at the master > localhost/127.0.0.1:19998 got UserId 109 > 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at > /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5 > 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost > 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998 > 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was > created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 > was created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 > was created! > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore > on localhost:43776 (size: 0.0 B) > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore > on localhost:43776 (size: 2.0 B) > 15/08/27 17:53:08 INFO BlockManagerInfo: Added r
[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725552#comment-14725552 ] Imran Rashid commented on SPARK-2666: - I'm copying [~kayousterhout]'s comment from the PR here for discussion: bq. My understanding is that it can help to let the remaining tasks run -- because they may hit Fetch failures from different map outputs than the original fetch failure, which will lead to the DAGScheduler to more quickly reschedule all of the failed tasks. For example, if an executor failed and had multiple map outputs on it, the first Fetch failure will only tell us about one of the map outputs being missing, and it's helpful to learn about all of them before we resubmit the earlier stage. Did you already think about this / am I misunderstanding the issue? Things may have changed in the meantime, but I'm pretty sure that now, when there is a fetch failure, spark assumes its lost *all* of the map output for that host. Its a bit confusing -- it seems we first only remove [the one map output with the failure|https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1134] but then we remove all map outputs in [{{handleExecutorLost}} | https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1184]. I suppose it could still be useful to run the remaining tasks, as they may discover *another* executor that has died, but I don't think its worth it just for that, right? Elsewhere we've also discussed always killing all tasks as soon as the {{TaskSetManager}} is marked as a zombie, see https://github.com/squito/spark/pull/4. I'm particularly interested b/c this is relevant to SPARK-10370. In that case, there wouldn't be any benefit to leaving tasks as running after marking the stage as zombie. If we do want to cancel all tasks as soon as we mark a stage as zombie, then I'd prefer we go the route of making {{isZombie}} private, and make task cancellation part of {{markAsZombie}} to make the code easier to follow and make sure we always cancel tasks. Is my understanding correct? Other opinions on the right approach here? > when task is FetchFailed cancel running tasks of failedStage > > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Lianhui Wang > > in DAGScheduler's handleTaskCompletion,when reason of failed task is > FetchFailed, cancel running tasks of failedStage before add failedStage to > failedStages queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies
[ https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-10370: - Description: Follow up to SPARK-5259. During stage retry, its possible for a stage to "complete" by registering all its map output and starting the downstream stages, before the latest task set has completed. This will result in the earlier task set continuing to submit tasks, that are both unnecessary and increase the chance of hitting SPARK-8029. Spark should mark all tasks sets for a stage as zombie as soon as its map output is registered. Note that this involves coordination between the various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at least) which isn't easily testable with the current setup. To be clear, this is *not* just referring to canceling running tasks (which may be taken care of by SPARK-2666). This is to make sure that the taskset is marked as a zombie, to prevent submitting *new* tasks from this task set. was: Follow up to SPARK-5259. During stage retry, its possible for a stage to "complete" by registering all its map output and starting the downstream stages, before the latest task set has completed. This will result in the earlier task set continuing to submit tasks, that are both unnecessary and increase the chance of hitting SPARK-8029. Spark should mark all tasks sets for a stage as zombie as soon as its map output is registered. Note that this involves coordination between the various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at least) which isn't easily testable with the current setup. > After a stages map outputs are registered, all running attempts should be > marked as zombies > --- > > Key: SPARK-10370 > URL: https://issues.apache.org/jira/browse/SPARK-10370 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Imran Rashid > > Follow up to SPARK-5259. During stage retry, its possible for a stage to > "complete" by registering all its map output and starting the downstream > stages, before the latest task set has completed. This will result in the > earlier task set continuing to submit tasks, that are both unnecessary and > increase the chance of hitting SPARK-8029. > Spark should mark all tasks sets for a stage as zombie as soon as its map > output is registered. Note that this involves coordination between the > various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at > least) which isn't easily testable with the current setup. > To be clear, this is *not* just referring to canceling running tasks (which > may be taken care of by SPARK-2666). This is to make sure that the taskset > is marked as a zombie, to prevent submitting *new* tasks from this task set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luciano Resende reopened SPARK-10398: - There are few other places where the closer.cgi is referenced. > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Sean Owen >Priority: Minor > Fix For: 1.5.0 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luciano Resende updated SPARK-10398: Attachment: SPARK-10398 This patch handles other download links referenced on the Spark docs as well. > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Sean Owen >Priority: Minor > Fix For: 1.5.0 > > Attachments: SPARK-10398 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9988) Create local (external) sort operator
[ https://issues.apache.org/jira/browse/SPARK-9988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725589#comment-14725589 ] Shixiong Zhu commented on SPARK-9988: - {{ExternalSorter}} is coupled with {{SparkEnv}}, {{ShuffleMemoryManager}} and {{DiskBlockManager}}, and finally depends on {{SparkContext}}. [~rxin] any thoughts to avoid depending on {{SparkContext}}? I'm thinking that at least we need something like {{ShuffleMemoryManager}} and {{DiskBlockManager}}. > Create local (external) sort operator > - > > Key: SPARK-9988 > URL: https://issues.apache.org/jira/browse/SPARK-9988 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Similar to the TungstenSort. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725592#comment-14725592 ] Sean Owen commented on SPARK-10398: --- Good catch, there's another use in the project docs themselves, not just the Apache site's download link. We use PRs rather than patches (https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) but I can easily do that. Editing the old doc releases gives me pause since they'd then not be the same docs you'd get by generating docs from the old release tag. However I suspect it matters little either way and so should just be fixed. > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Sean Owen >Priority: Minor > Fix For: 1.5.0 > > Attachments: SPARK-10398 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10400) Rename or deprecate SQL option "spark.sql.parquet.followParquetFormatSpec"
Cheng Lian created SPARK-10400: -- Summary: Rename or deprecate SQL option "spark.sql.parquet.followParquetFormatSpec" Key: SPARK-10400 URL: https://issues.apache.org/jira/browse/SPARK-10400 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec. However, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they have opposite meanings). Note that this option is not "public" ({{isPublic}} is false). At the moment of writing, 1.5 RC3 has already been cut. If we can't make this one into 1.5, we can deprecate the old option with the new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725615#comment-14725615 ] Imran Rashid commented on SPARK-2666: - I realized I didn't very clearly spell out one of my main points: I am proposing widening this issue to not be about only {{FetchFailed}}. I think instead we should consider changing this issue to refactor the code to unify "zombification" and cancelling tasks. In general I know that smaller changes are better, especially related to the scheduler, but in this case I think we'll be able to improve the code by tackling them together. > when task is FetchFailed cancel running tasks of failedStage > > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Lianhui Wang > > in DAGScheduler's handleTaskCompletion,when reason of failed task is > FetchFailed, cancel running tasks of failedStage before add failedStage to > failedStages queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725619#comment-14725619 ] Luciano Resende commented on SPARK-10398: - I can submit a PR for the docs as well, let me look into those. > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Sean Owen >Priority: Minor > Fix For: 1.5.0 > > Attachments: SPARK-10398 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725622#comment-14725622 ] Apache Spark commented on SPARK-10398: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/8557 > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Sean Owen >Priority: Minor > Fix For: 1.5.0 > > Attachments: SPARK-10398 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725632#comment-14725632 ] Apache Spark commented on SPARK-10398: -- User 'lresende' has created a pull request for this issue: https://github.com/apache/spark/pull/8558 > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Sean Owen >Priority: Minor > Fix For: 1.5.0 > > Attachments: SPARK-10398 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10398: -- Assignee: Luciano Resende (was: Sean Owen) > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Luciano Resende >Priority: Minor > Fix For: 1.5.0 > > Attachments: SPARK-10398 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9008) Stop and remove driver from supervised mode in spark-master interface
[ https://issues.apache.org/jira/browse/SPARK-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725637#comment-14725637 ] Alberto Miorin commented on SPARK-9008: --- I have the same problem, but with spark mesos cluster mode. I tried to spark-submit --kill but the driver is always restarted by the dispatcher. I think there should be a subcommand spark-submit --unsupervise > Stop and remove driver from supervised mode in spark-master interface > - > > Key: SPARK-9008 > URL: https://issues.apache.org/jira/browse/SPARK-9008 > Project: Spark > Issue Type: New Feature > Components: Deploy >Reporter: Jesper Lundgren >Priority: Minor > > The cluster will automatically restart failing drivers when launched in > supervised cluster mode. However there is no official way for a operation > team to stop and remove a driver from restarting in case it is > malfunctioning. > I know there is "bin/spark-class org.apache.spark.deploy.Client kill" but > this is undocumented and does not always work so well. > It would be great if there was a way to remove supervised mode to allow kill > -9 to work on a driver program. > The documentation surrounding this could also see some improvements. It would > be nice to have some best practice examples on how to work with supervised > mode, how to manage graceful shutdown and catch TERM signals. (TERM signal > will end with an exit code that triggers restart in supervised mode unless > you change the exit code in the application logic) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10401) spark-submit --unsupervise
Alberto Miorin created SPARK-10401: -- Summary: spark-submit --unsupervise Key: SPARK-10401 URL: https://issues.apache.org/jira/browse/SPARK-10401 Project: Spark Issue Type: New Feature Components: Deploy, Mesos Affects Versions: 1.5.0 Reporter: Alberto Miorin When I submit a streaming job with the option --supervise to the new mesos spark dispatcher, I cannot decommission the job. I tried spark-submit --kill, but dispatcher always restarts it. Driver and Executors are both Docker containers. I think there should be a subcommand spark-submit --unsupervise -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org