[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698 ] Bryan Cutler edited comment on SPARK-23109 at 1/20/18 3:17 AM: --- I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel for both of the above https://issues.apache.org/jira/browse/SPARK-23161 clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj - https://issues.apache.org/jira/browse/SPARK-23162 stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 for the above DOC issues https://issues.apache.org/jira/browse/SPARK-23163 was (Author: bryanc): I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel for both of the above https://issues.apache.org/jira/browse/SPARK-23161 clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj - https://issues.apache.org/jira/browse/SPARK-23162 stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To
[jira] [Commented] (SPARK-23163) Sync Python ML API docs with Scala
[ https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333154#comment-16333154 ] Bryan Cutler commented on SPARK-23163: -- I'll do this, just a few minor things > Sync Python ML API docs with Scala > -- > > Key: SPARK-23163 > URL: https://issues.apache.org/jira/browse/SPARK-23163 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Trivial > > Fix a few doc issues as reported in 2.3 ML QA SPARK-23109 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23163) Sync Python ML API docs with Scala
Bryan Cutler created SPARK-23163: Summary: Sync Python ML API docs with Scala Key: SPARK-23163 URL: https://issues.apache.org/jira/browse/SPARK-23163 Project: Spark Issue Type: Documentation Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Bryan Cutler Fix a few doc issues as reported in 2.3 ML QA SPARK-23109 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698 ] Bryan Cutler edited comment on SPARK-23109 at 1/20/18 3:06 AM: --- I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel for both of the above https://issues.apache.org/jira/browse/SPARK-23161 clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj - https://issues.apache.org/jira/browse/SPARK-23162 stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 was (Author: bryanc): I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel for both of the above https://issues.apache.org/jira/browse/SPARK-23161 clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj - https://issues.apache.org/jira/browse/SPARK-23162 stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698 ] Bryan Cutler edited comment on SPARK-23109 at 1/20/18 3:05 AM: --- I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel for both of the above https://issues.apache.org/jira/browse/SPARK-23161 clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj - https://issues.apache.org/jira/browse/SPARK-23162 stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 was (Author: bryanc): I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel https://issues.apache.org/jira/browse/SPARK-23161 for the above clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23161) Add missing APIs to Python GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23161: - Labels: starter (was: ) > Add missing APIs to Python GBTClassifier > > > Key: SPARK-23161 > URL: https://issues.apache.org/jira/browse/SPARK-23161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Minor > Labels: starter > > GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved > {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. > GBTClassificationModel is missing {{numClasses}}. It should inherit from > {{JavaClassificationModel}} instead of prediction model which will give it > this param. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj
Bryan Cutler created SPARK-23162: Summary: PySpark ML LinearRegressionSummary missing r2adj Key: SPARK-23162 URL: https://issues.apache.org/jira/browse/SPARK-23162 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Bryan Cutler Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23161) Add missing APIs to Python GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23161: - Priority: Minor (was: Major) > Add missing APIs to Python GBTClassifier > > > Key: SPARK-23161 > URL: https://issues.apache.org/jira/browse/SPARK-23161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Minor > Labels: starter > > GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved > {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. > GBTClassificationModel is missing {{numClasses}}. It should inherit from > {{JavaClassificationModel}} instead of prediction model which will give it > this param. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698 ] Bryan Cutler edited comment on SPARK-23109 at 1/20/18 3:00 AM: --- I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel https://issues.apache.org/jira/browse/SPARK-23161 for the above clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 was (Author: bryanc): I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23161) Add missing APIs to Python GBTClassifier
Bryan Cutler created SPARK-23161: Summary: Add missing APIs to Python GBTClassifier Key: SPARK-23161 URL: https://issues.apache.org/jira/browse/SPARK-23161 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Bryan Cutler GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. GBTClassificationModel is missing {{numClasses}}. It should inherit from {{JavaClassificationModel}} instead of prediction model which will give it this param. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be a
[ https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23121: --- Affects Version/s: (was: 2.4.0) 2.3.0 > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > - > > Key: SPARK-23121 > URL: https://issues.apache.org/jira/browse/SPARK-23121 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Major > Attachments: 1.png, 2.png > > > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > > Test command: > ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount > ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark > > The app is running for a period of time, ui can not be accessed, please see > attachment. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be a
[ https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23121: --- Target Version/s: 2.3.0 > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > - > > Key: SPARK-23121 > URL: https://issues.apache.org/jira/browse/SPARK-23121 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Major > Attachments: 1.png, 2.png > > > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > > Test command: > ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount > ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark > > The app is running for a period of time, ui can not be accessed, please see > attachment. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23091) Incorrect unit test for approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23091: Component/s: SQL > Incorrect unit test for approxQuantile > -- > > Key: SPARK-23091 > URL: https://issues.apache.org/jira/browse/SPARK-23091 > Project: Spark > Issue Type: Improvement > Components: ML, SQL, Tests >Affects Versions: 2.2.1 >Reporter: Kuang Chen >Priority: Minor > > Currently, test for `approxQuantile` (quantile estimation algorithm) checks > whether estimated quantile is within +- 2*`relativeError` from the true > quantile. See the code below: > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157] > However, based on the original paper by Greenwald and Khanna, the estimated > quantile is guaranteed to be within +- `relativeError` from the true > quantile. Using the double "tolerance" is misleading and incorrect, and we > should fix it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21771) SparkSQLEnv creates a useless meta hive client
[ https://issues.apache.org/jira/browse/SPARK-21771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21771. - Resolution: Fixed Assignee: Kent Yao Fix Version/s: 2.3.0 > SparkSQLEnv creates a useless meta hive client > -- > > Key: SPARK-21771 > URL: https://issues.apache.org/jira/browse/SPARK-21771 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 2.3.0 > > > Once a meta hive client is created, it generates its SessionState which > creates a lot of session related directories, some deleteOnExit, some does > not. if a hive client is useless we may not create it at the very start. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332999#comment-16332999 ] Henry Robinson commented on SPARK-23148: It seems like the problem is that {{CodecStreams.createInputStreamWithCloseResource}} can't properly handle a {{path}} argument that's URL-encoded. We could add an overload for {{createInputStreamWithCloseResource(Configuration, Path)}} and then pass {{new Path(new URI(path))}} from {{CSVDataSource.readFile()}}. This has the benefit of being a more localised change (and doesn't change the 'contract' that comes from {{FileScanRDD}} currently having URL-encoded pathnames everywhere. A strawman commit is [here|https://github.com/henryr/spark/commit/b8c51418ee7d4bca18179fd863f7f4885c98c0ef]. > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bogdan Raducanu >Priority: Major > > Repro code: > {code:java} > spark.range(10).write.csv("/tmp/a b c/a.csv") > spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count > 10 > spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count > java.io.FileNotFoundException: File > file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv > does not exist > {code} > Trying to manually escape fails in a different place: > {code} > spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/tmp/a%20b%20c/a.csv; > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332999#comment-16332999 ] Henry Robinson edited comment on SPARK-23148 at 1/19/18 11:25 PM: -- It seems like the problem is that {{CodecStreams.createInputStreamWithCloseResource}} can't properly handle a {{path}} argument that's URL-encoded. We could add an overload for {{createInputStreamWithCloseResource(Configuration, Path)}} and then pass {{new Path(new URI(path))}} from {{CSVDataSource.readFile()}}. This has the benefit of being a more localised change (and doesn't change the 'contract' that comes from {{FileScanRDD}} currently having URL-encoded pathnames everywhere). A strawman commit is [here|https://github.com/henryr/spark/commit/b8c51418ee7d4bca18179fd863f7f4885c98c0ef]. was (Author: henryr): It seems like the problem is that {{CodecStreams.createInputStreamWithCloseResource}} can't properly handle a {{path}} argument that's URL-encoded. We could add an overload for {{createInputStreamWithCloseResource(Configuration, Path)}} and then pass {{new Path(new URI(path))}} from {{CSVDataSource.readFile()}}. This has the benefit of being a more localised change (and doesn't change the 'contract' that comes from {{FileScanRDD}} currently having URL-encoded pathnames everywhere. A strawman commit is [here|https://github.com/henryr/spark/commit/b8c51418ee7d4bca18179fd863f7f4885c98c0ef]. > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bogdan Raducanu >Priority: Major > > Repro code: > {code:java} > spark.range(10).write.csv("/tmp/a b c/a.csv") > spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count > 10 > spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count > java.io.FileNotFoundException: File > file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv > does not exist > {code} > Trying to manually escape fails in a different place: > {code} > spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/tmp/a%20b%20c/a.csv; > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23087) CheckCartesianProduct too restrictive when condition is constant folded to false/null
[ https://issues.apache.org/jira/browse/SPARK-23087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23087: Assignee: (was: Apache Spark) > CheckCartesianProduct too restrictive when condition is constant folded to > false/null > - > > Key: SPARK-23087 > URL: https://issues.apache.org/jira/browse/SPARK-23087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Juliusz Sompolski >Priority: Minor > > Running > {code} > sql("SELECT id as a FROM RANGE(10)").createOrReplaceTempView("A") > sql("SELECT NULL as a FROM RANGE(10)").createOrReplaceTempView("NULLTAB") > sql("SELECT 1 as goo FROM A LEFT OUTER JOIN NULLTAB ON A.a = > NULLTAB.a").collect() > {code} > results in: > {code} > org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT > OUTER join between logical plans > Project > +- Range (0, 10, step=1, splits=None) > and > Project > +- Range (0, 10, step=1, splits=None) > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) > {code} > This is because NULLTAB.a is constant folded to null, and then the condition > is constant folded altogether: > {code} > === Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation === > GlobalLimit 21 > +- LocalLimit 21 > +- Project [1 AS goo#28] > ! +- Join LeftOuter, (a#0L = null) > :- Project [id#1L AS a#0L] > : +- Range (0, 10, step=1, splits=None) > +- Project > +- Range (0, 10, step=1, splits=None) > GlobalLimit 21 > +- LocalLimit 21 >+- Project [1 AS goo#28] > +- Join LeftOuter, null > :- Project [id#1L AS a#0L] > : +- Range (0, 10, step=1, splits=None) > +- Project > +- Range (0, 10, step=1, splits=None) > {code} > And then CheckCartesianProduct doesn't like it, even though the condition > does not produce a cartesian product, but evaluates to null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23087) CheckCartesianProduct too restrictive when condition is constant folded to false/null
[ https://issues.apache.org/jira/browse/SPARK-23087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23087: Assignee: Apache Spark > CheckCartesianProduct too restrictive when condition is constant folded to > false/null > - > > Key: SPARK-23087 > URL: https://issues.apache.org/jira/browse/SPARK-23087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Juliusz Sompolski >Assignee: Apache Spark >Priority: Minor > > Running > {code} > sql("SELECT id as a FROM RANGE(10)").createOrReplaceTempView("A") > sql("SELECT NULL as a FROM RANGE(10)").createOrReplaceTempView("NULLTAB") > sql("SELECT 1 as goo FROM A LEFT OUTER JOIN NULLTAB ON A.a = > NULLTAB.a").collect() > {code} > results in: > {code} > org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT > OUTER join between logical plans > Project > +- Range (0, 10, step=1, splits=None) > and > Project > +- Range (0, 10, step=1, splits=None) > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) > {code} > This is because NULLTAB.a is constant folded to null, and then the condition > is constant folded altogether: > {code} > === Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation === > GlobalLimit 21 > +- LocalLimit 21 > +- Project [1 AS goo#28] > ! +- Join LeftOuter, (a#0L = null) > :- Project [id#1L AS a#0L] > : +- Range (0, 10, step=1, splits=None) > +- Project > +- Range (0, 10, step=1, splits=None) > GlobalLimit 21 > +- LocalLimit 21 >+- Project [1 AS goo#28] > +- Join LeftOuter, null > :- Project [id#1L AS a#0L] > : +- Range (0, 10, step=1, splits=None) > +- Project > +- Range (0, 10, step=1, splits=None) > {code} > And then CheckCartesianProduct doesn't like it, even though the condition > does not produce a cartesian product, but evaluates to null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23087) CheckCartesianProduct too restrictive when condition is constant folded to false/null
[ https://issues.apache.org/jira/browse/SPARK-23087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332907#comment-16332907 ] Apache Spark commented on SPARK-23087: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20333 > CheckCartesianProduct too restrictive when condition is constant folded to > false/null > - > > Key: SPARK-23087 > URL: https://issues.apache.org/jira/browse/SPARK-23087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Juliusz Sompolski >Priority: Minor > > Running > {code} > sql("SELECT id as a FROM RANGE(10)").createOrReplaceTempView("A") > sql("SELECT NULL as a FROM RANGE(10)").createOrReplaceTempView("NULLTAB") > sql("SELECT 1 as goo FROM A LEFT OUTER JOIN NULLTAB ON A.a = > NULLTAB.a").collect() > {code} > results in: > {code} > org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT > OUTER join between logical plans > Project > +- Range (0, 10, step=1, splits=None) > and > Project > +- Range (0, 10, step=1, splits=None) > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) > {code} > This is because NULLTAB.a is constant folded to null, and then the condition > is constant folded altogether: > {code} > === Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation === > GlobalLimit 21 > +- LocalLimit 21 > +- Project [1 AS goo#28] > ! +- Join LeftOuter, (a#0L = null) > :- Project [id#1L AS a#0L] > : +- Range (0, 10, step=1, splits=None) > +- Project > +- Range (0, 10, step=1, splits=None) > GlobalLimit 21 > +- LocalLimit 21 >+- Project [1 AS goo#28] > +- Join LeftOuter, null > :- Project [id#1L AS a#0L] > : +- Range (0, 10, step=1, splits=None) > +- Project > +- Range (0, 10, step=1, splits=None) > {code} > And then CheckCartesianProduct doesn't like it, even though the condition > does not produce a cartesian product, but evaluates to null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore
[ https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal resolved SPARK-23135. Resolution: Fixed > Accumulators don't show up properly in the Stages page anymore > -- > > Key: SPARK-23135 > URL: https://issues.apache.org/jira/browse/SPARK-23135 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 > Environment: > > >Reporter: Burak Yavuz >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.3.0 > > Attachments: webUIAccumulatorRegression.png > > > Didn't do a lot of digging but may be caused by: > [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932] > > !webUIAccumulatorRegression.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23087) CheckCartesianProduct too restrictive when condition is constant folded to false/null
[ https://issues.apache.org/jira/browse/SPARK-23087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-23087: Priority: Minor (was: Major) > CheckCartesianProduct too restrictive when condition is constant folded to > false/null > - > > Key: SPARK-23087 > URL: https://issues.apache.org/jira/browse/SPARK-23087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Juliusz Sompolski >Priority: Minor > > Running > {code} > sql("SELECT id as a FROM RANGE(10)").createOrReplaceTempView("A") > sql("SELECT NULL as a FROM RANGE(10)").createOrReplaceTempView("NULLTAB") > sql("SELECT 1 as goo FROM A LEFT OUTER JOIN NULLTAB ON A.a = > NULLTAB.a").collect() > {code} > results in: > {code} > org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT > OUTER join between logical plans > Project > +- Range (0, 10, step=1, splits=None) > and > Project > +- Range (0, 10, step=1, splits=None) > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121) > {code} > This is because NULLTAB.a is constant folded to null, and then the condition > is constant folded altogether: > {code} > === Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation === > GlobalLimit 21 > +- LocalLimit 21 > +- Project [1 AS goo#28] > ! +- Join LeftOuter, (a#0L = null) > :- Project [id#1L AS a#0L] > : +- Range (0, 10, step=1, splits=None) > +- Project > +- Range (0, 10, step=1, splits=None) > GlobalLimit 21 > +- LocalLimit 21 >+- Project [1 AS goo#28] > +- Join LeftOuter, null > :- Project [id#1L AS a#0L] > : +- Range (0, 10, step=1, splits=None) > +- Project > +- Range (0, 10, step=1, splits=None) > {code} > And then CheckCartesianProduct doesn't like it, even though the condition > does not produce a cartesian product, but evaluates to null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore
[ https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal reassigned SPARK-23135: -- Assignee: Marcelo Vanzin Fix Version/s: 2.3.0 Issue resolved by pull request 20299 https://github.com/apache/spark/pull/20299 > Accumulators don't show up properly in the Stages page anymore > -- > > Key: SPARK-23135 > URL: https://issues.apache.org/jira/browse/SPARK-23135 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 > Environment: > > >Reporter: Burak Yavuz >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.3.0 > > Attachments: webUIAccumulatorRegression.png > > > Didn't do a lot of digging but may be caused by: > [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932] > > !webUIAccumulatorRegression.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-18085. Resolution: Fixed Fix Version/s: 2.3.0 All of the sub-tasks of the SPIP are committed, so I'm closing this out. There are still a whole bunch of enhancements that can be done on top of the new stuff, but those can be added later. Thanks to all who helped with reviews and testing! > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Major > Labels: SPIP > Fix For: 2.3.0 > > Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!
[ https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-12963: -- Shepherd: Sean Owen > In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' > failed after 16 retries! > - > > Key: SPARK-12963 > URL: https://issues.apache.org/jira/browse/SPARK-12963 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0 >Reporter: lichenglin >Priority: Critical > > I have 3 node cluster:namenode second and data1; > I use this shell to submit job on namenode: > bin/spark-submit --deploy-mode cluster --class com.bjdv.spark.job.Abc > --total-executor-cores 5 --master spark://namenode:6066 > hdfs://namenode:9000/sparkjars/spark.jar > The Driver may be started on the other node such as data1. > The problem is : > when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode > the driver will be started with this param such as > SPARK_LOCAL_IP=namenode > but the driver will start at data1, > the dirver will try to binding the ip 'namenode' on data1. > so driver will throw exception like this: > Service 'Driver' failed after 16 retries! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-21994: --- Attachment: (was: Srinivasa Reddy Vundela.url) > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods >Priority: Major > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zikun updated SPARK-21994: -- Attachment: Srinivasa Reddy Vundela.url > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods >Priority: Major > Attachments: Srinivasa Reddy Vundela.url > > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11499) Spark History Server UI should respect protocol when doing redirection
[ https://issues.apache.org/jira/browse/SPARK-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332811#comment-16332811 ] paul mackles commented on SPARK-11499: -- We ran into this issue running the spark-history server as a Marathon app on a Mesos cluster. As is typical for this kind of setup, there is a reverse-proxy that users go through to access the app. In our case, we are also offloading SSL to the reverse-proxy so communications between the reverse-proxy and spark-history are plain-old HTTP. I experimented with 2 different fixes: # Making sure that the SparkUI and History components look at APPLICATION_WEB_PROXY_BASE when generating redirect URLs. In order for it to honor the protocol, APPLICATION_WEB_PROXY_BASE must include the desired protocol (i.e. APPLICATION_WEB_PROXY_BASE=https://example.com) # Using Jetty's built-in ForwardRequestCustomizer class to process "X-Forwarded-*" headers defined in rfc7239. Both changes worked in our environment and both changes are fairly simple. Looking for feedback on whether one solution is preferable to the other. For our environment, #2 is preferable because: * The reverse proxy we use is already sending these headers. * Allows for the spark-history server to see the actual client info as opposed to that of the proxy If no strong feelings one way or another, I'll submit a PR for solution #2. References: * [https://tools.ietf.org/html/rfc7239] * [http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/ForwardedRequestCustomizer.html] > Spark History Server UI should respect protocol when doing redirection > -- > > Key: SPARK-11499 > URL: https://issues.apache.org/jira/browse/SPARK-11499 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Lukasz Jastrzebski >Priority: Major > > Use case: > Spark history server is behind load balancer secured with ssl certificate, > unfortunately clicking on the application link redirects it to http protocol, > which may be not expose by load balancer, example flow: > * Trying 52.22.220.1... > * Connected to xxx.yyy.com (52.22.220.1) port 8775 (#0) > * WARNING: SSL: Certificate type not set, assuming PKCS#12 format. > * Client certificate: u...@yyy.com > * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 > * Server certificate: *.yyy.com > * Server certificate: Entrust Certification Authority - L1K > * Server certificate: Entrust Root Certification Authority - G2 > > GET /history/20151030-160604-3039174572-5951-22401-0004 HTTP/1.1 > > Host: xxx.yyy.com:8775 > > User-Agent: curl/7.43.0 > > Accept: */* > > > < HTTP/1.1 302 Found > < Location: > http://xxx.yyy.com:8775/history/20151030-160604-3039174572-5951-22401-0004 > < Connection: close > < Server: Jetty(8.y.z-SNAPSHOT) > < > * Closing connection 0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23103) LevelDB store not iterating correctly when indexed value has negative value
[ https://issues.apache.org/jira/browse/SPARK-23103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-23103. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20284 [https://github.com/apache/spark/pull/20284] > LevelDB store not iterating correctly when indexed value has negative value > --- > > Key: SPARK-23103 > URL: https://issues.apache.org/jira/browse/SPARK-23103 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > Marking as minor since I don't believe we currently have anything that needs > to store negative values in indexed fields. But I wrote a unit test and got: > > {noformat} > [error] Test > org.apache.spark.util.kvstore.LevelDBSuite.testNegativeIndexValues failed: > java.lang.AssertionError: expected:<[-50, 0, 50]> but was:<[[0, -50, 50]]>, > took 0.025 sec > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23103) LevelDB store not iterating correctly when indexed value has negative value
[ https://issues.apache.org/jira/browse/SPARK-23103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-23103: Assignee: Marcelo Vanzin > LevelDB store not iterating correctly when indexed value has negative value > --- > > Key: SPARK-23103 > URL: https://issues.apache.org/jira/browse/SPARK-23103 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > > Marking as minor since I don't believe we currently have anything that needs > to store negative values in indexed fields. But I wrote a unit test and got: > > {noformat} > [error] Test > org.apache.spark.util.kvstore.LevelDBSuite.testNegativeIndexValues failed: > java.lang.AssertionError: expected:<[-50, 0, 50]> but was:<[[0, -50, 50]]>, > took 0.025 sec > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20664) Remove stale applications from SHS listing
[ https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-20664. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20138 [https://github.com/apache/spark/pull/20138] > Remove stale applications from SHS listing > -- > > Key: SPARK-20664 > URL: https://issues.apache.org/jira/browse/SPARK-20664 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task is actually not explicit in the spec, and it's also an issue with > the current SHS. But having the SHS persist listing data makes it worse. > Basically, the SHS currently does not detect when files are deleted from the > event log directory manually; so those applications are still listed, and > trying to see their UI will either show the UI (if it's loaded) or an error > (if it's not). > With the new SHS, that also means that data is leaked in the disk stores used > to persist listing and UI data, making the problem worse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20664) Remove stale applications from SHS listing
[ https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-20664: Assignee: Marcelo Vanzin > Remove stale applications from SHS listing > -- > > Key: SPARK-20664 > URL: https://issues.apache.org/jira/browse/SPARK-20664 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task is actually not explicit in the spec, and it's also an issue with > the current SHS. But having the SHS persist listing data makes it worse. > Basically, the SHS currently does not detect when files are deleted from the > event log directory manually; so those applications are still listed, and > trying to see their UI will either show the UI (if it's loaded) or an error > (if it's not). > With the new SHS, that also means that data is leaked in the disk stores used > to persist listing and UI data, making the problem worse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22360) Add unit test for Window Specifications
[ https://issues.apache.org/jira/browse/SPARK-22360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332773#comment-16332773 ] Jiang Xingbo commented on SPARK-22360: -- Created https://issues.apache.org/jira/browse/SPARK-23160 > Add unit test for Window Specifications > --- > > Key: SPARK-22360 > URL: https://issues.apache.org/jira/browse/SPARK-22360 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Major > > * different partition clauses (none, one, multiple) > * different order clauses (none, one, multiple, asc/desc, nulls first/last) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23160) Add more window sql tests
Jiang Xingbo created SPARK-23160: Summary: Add more window sql tests Key: SPARK-23160 URL: https://issues.apache.org/jira/browse/SPARK-23160 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Jiang Xingbo We should also cover the window sql interface, example in `sql/core/src/test/resources/sql-tests/inputs/window.sql`, it should also be funny to see whether we can generate consistent results for window tests in other major databases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22360) Add unit test for Window Specifications
[ https://issues.apache.org/jira/browse/SPARK-22360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332762#comment-16332762 ] Jiang Xingbo commented on SPARK-22360: -- Sorry for late response. It's great that we can cover the DataFrame test cases, I really think we should have them soon. Besides, we should also cover the window sql interface, example in `sql/core/src/test/resources/sql-tests/inputs/window.sql`, it should also be funny to see whether we can generate consistent results for window tests in other major databases. [~smilegator] WDYT? > Add unit test for Window Specifications > --- > > Key: SPARK-22360 > URL: https://issues.apache.org/jira/browse/SPARK-22360 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Major > > * different partition clauses (none, one, multiple) > * different order clauses (none, one, multiple, asc/desc, nulls first/last) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23138) Add user guide example for multiclass logistic regression summary
[ https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332754#comment-16332754 ] Apache Spark commented on SPARK-23138: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/20332 > Add user guide example for multiclass logistic regression summary > - > > Key: SPARK-23138 > URL: https://issues.apache.org/jira/browse/SPARK-23138 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Priority: Minor > > We haven't updated the user guide to reflect the multiclass logistic > regression summary added in SPARK-17139. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23138) Add user guide example for multiclass logistic regression summary
[ https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23138: Assignee: (was: Apache Spark) > Add user guide example for multiclass logistic regression summary > - > > Key: SPARK-23138 > URL: https://issues.apache.org/jira/browse/SPARK-23138 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Priority: Minor > > We haven't updated the user guide to reflect the multiclass logistic > regression summary added in SPARK-17139. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23138) Add user guide example for multiclass logistic regression summary
[ https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23138: Assignee: Apache Spark > Add user guide example for multiclass logistic regression summary > - > > Key: SPARK-23138 > URL: https://issues.apache.org/jira/browse/SPARK-23138 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Assignee: Apache Spark >Priority: Minor > > We haven't updated the user guide to reflect the multiclass logistic > regression summary added in SPARK-17139. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23159) Update Cloudpickle to match version 0.4.2
[ https://issues.apache.org/jira/browse/SPARK-23159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332752#comment-16332752 ] Bryan Cutler commented on SPARK-23159: -- I can work on this > Update Cloudpickle to match version 0.4.2 > - > > Key: SPARK-23159 > URL: https://issues.apache.org/jira/browse/SPARK-23159 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Major > > Update PySpark's version of Cloudpickle to match version 0.4.2. The reasons > for doing this are: > * Pick up bug fixes, improvements with newer version > * Match a specific version as close as possible (Spark has additional > changes that might be necessary) to make future upgrades easier > There are newer versions of Cloudpickle that can fix bugs with NamedTuple > pickling (that Spark currently has workarounds for), but these include other > changes that need some verification before bringing into Spark. Upgrading > first to 0.4.2 will help make this verification easier. > Discussion on the mailing list: > http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23159) Update Cloudpickle to match version 0.4.2
Bryan Cutler created SPARK-23159: Summary: Update Cloudpickle to match version 0.4.2 Key: SPARK-23159 URL: https://issues.apache.org/jira/browse/SPARK-23159 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.3.0 Reporter: Bryan Cutler Update PySpark's version of Cloudpickle to match version 0.4.2. The reasons for doing this are: * Pick up bug fixes, improvements with newer version * Match a specific version as close as possible (Spark has additional changes that might be necessary) to make future upgrades easier There are newer versions of Cloudpickle that can fix bugs with NamedTuple pickling (that Spark currently has workarounds for), but these include other changes that need some verification before bringing into Spark. Upgrading first to 0.4.2 will help make this verification easier. Discussion on the mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332703#comment-16332703 ] Bryan Cutler commented on SPARK-23109: -- [~josephkb] the image module is missing many of the get* methods that are in Scala - is it meant to have an equivalent API or is the usage a little different? > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698 ] Bryan Cutler commented on SPARK-23109: -- I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23137) spark.kubernetes.executor.podNamePrefix is ignored
[ https://issues.apache.org/jira/browse/SPARK-23137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23137. Resolution: Fixed Assignee: Anirudh Ramanathan Fix Version/s: 2.3.0 > spark.kubernetes.executor.podNamePrefix is ignored > -- > > Key: SPARK-23137 > URL: https://issues.apache.org/jira/browse/SPARK-23137 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Assignee: Anirudh Ramanathan >Priority: Major > Fix For: 2.3.0 > > > [~liyinan926] is fixing this as we speak. Should be a very minor change. > It's also a non-critical option, so, if we decide that the safer thing is to > just remove it, we can do that as well. Will leave that decision to the > release czar and reviewers. > > [~vanzin] [~felixcheung] [~sameerag] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23104) Document that kubernetes is still "experimental"
[ https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23104. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20314 [https://github.com/apache/spark/pull/20314] > Document that kubernetes is still "experimental" > > > Key: SPARK-23104 > URL: https://issues.apache.org/jira/browse/SPARK-23104 > Project: Spark > Issue Type: Task > Components: Documentation, Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Anirudh Ramanathan >Priority: Critical > Fix For: 2.3.0 > > > As discussed in the mailing list, we should document that the kubernetes > backend is still experimental. > That does not need to include any code changes. This is just meant to tell > users that they can expect changes in how the backend behaves in future > versions, and that things like configuration, the container image's layout > and entry points might change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16617) Upgrade to Avro 1.8.x
[ https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-16617: -- Assignee: (was: Marcelo Vanzin) > Upgrade to Avro 1.8.x > - > > Key: SPARK-16617 > URL: https://issues.apache.org/jira/browse/SPARK-16617 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 2.1.0 >Reporter: Ben McCann >Priority: Major > > Avro 1.8 makes Avro objects serializable so that you can easily have an RDD > containing Avro objects. > See https://issues.apache.org/jira/browse/AVRO-1502 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16617) Upgrade to Avro 1.8.x
[ https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-16617: -- Assignee: Marcelo Vanzin > Upgrade to Avro 1.8.x > - > > Key: SPARK-16617 > URL: https://issues.apache.org/jira/browse/SPARK-16617 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 2.1.0 >Reporter: Ben McCann >Assignee: Marcelo Vanzin >Priority: Major > > Avro 1.8 makes Avro objects serializable so that you can easily have an RDD > containing Avro objects. > See https://issues.apache.org/jira/browse/AVRO-1502 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23158) Move HadoopFsRelationTest test suites to from sql/hive to sql/core
[ https://issues.apache.org/jira/browse/SPARK-23158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23158: Assignee: Xiao Li (was: Apache Spark) > Move HadoopFsRelationTest test suites to from sql/hive to sql/core > -- > > Key: SPARK-23158 > URL: https://issues.apache.org/jira/browse/SPARK-23158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23158) Move HadoopFsRelationTest test suites to from sql/hive to sql/core
[ https://issues.apache.org/jira/browse/SPARK-23158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332570#comment-16332570 ] Apache Spark commented on SPARK-23158: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20331 > Move HadoopFsRelationTest test suites to from sql/hive to sql/core > -- > > Key: SPARK-23158 > URL: https://issues.apache.org/jira/browse/SPARK-23158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23158) Move HadoopFsRelationTest test suites to from sql/hive to sql/core
[ https://issues.apache.org/jira/browse/SPARK-23158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23158: Assignee: Apache Spark (was: Xiao Li) > Move HadoopFsRelationTest test suites to from sql/hive to sql/core > -- > > Key: SPARK-23158 > URL: https://issues.apache.org/jira/browse/SPARK-23158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23158) Move HadoopFsRelationTest test suites to from sql/hive to sql/core
Xiao Li created SPARK-23158: --- Summary: Move HadoopFsRelationTest test suites to from sql/hive to sql/core Key: SPARK-23158 URL: https://issues.apache.org/jira/browse/SPARK-23158 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23149) polish ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23149. - Resolution: Fixed Fix Version/s: 2.3.0 > polish ColumnarBatch > > > Key: SPARK-23149 > URL: https://issues.apache.org/jira/browse/SPARK-23149 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
Tomasz Bartczak created SPARK-23157: --- Summary: withColumn fails for a column that is a result of mapped DataSet Key: SPARK-23157 URL: https://issues.apache.org/jira/browse/SPARK-23157 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Tomasz Bartczak Having {code:java} case class R(id: String) val ds = spark.createDataset(Seq(R("1"))) {code} This works: {code} scala> ds.withColumn("n", ds.col("id")) res16: org.apache.spark.sql.DataFrame = [id: string, n: string] {code} but when we map over ds it fails: {code} scala> ds.withColumn("n", ds.map(a => a).col("id")) org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing from id#4 in operator !Project [id#4, id#55 AS n#57];; !Project [id#4, id#55 AS n#57] +- LocalRelation [id#4] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) ... 48 elided {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-23156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332434#comment-16332434 ] Marco Gaido commented on SPARK-23156: - [~kzawisto] a lot of work on this has been done and it is both on 2.2 maintenance versions and some more will be in 2.3 (too many tickets to list). Please try to reproduce on current master, but I am quite sure this is a duplicate of many similar tickets and it will work. Thanks. > Code of method "initialize(I)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > > > Key: SPARK-23156 > URL: https://issues.apache.org/jira/browse/SPARK-23156 > Project: Spark > Issue Type: Bug > Components: Spark Submit, SQL >Affects Versions: 2.1.1, 2.1.2 > Environment: Ubuntu 16.04, Scala 2.11, Java 8, 8-node YARN cluster. >Reporter: Krystian Zawistowski >Priority: Major > > I am getting this trying to generate a random DataFrame (300 columns, 5000 > rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not > identical) to SPARK-18492 and few tickets more that should be done in 2.1.1. > Part of the logs below. They contain hundreds of millions of lines of > generated code, apparently for each of the 1500 000 fields of the dataframe > which is very suspicious. > {code:java} > 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$ > 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB$ > /* 001 */ public java.lang.Object generate(Object[] references) {$ > /* 002 */ return new SpecificUnsafeProjection(references);$ > /* 003 */ }$ > /* 004 */$ > /* 005 */ class SpecificUnsafeProjection extends > org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$ > /* 006 */$ > /* 007 */ private Object[] references;$ > /* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$ > /* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$ > /* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$ > /* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$ > /* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$ > {code} > Reproduction: > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.{Column, DataFrame, SparkSession} > class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends > Serializable { > private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime > private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime > val idColumn = "id" > import org.apache.spark.sql.functions._ > def generateData(path: String): Unit = { > val spark: SparkSession = SparkSession.builder().getOrCreate() > materializeTable(spark).write.parquet(path + "/source") > } > private def materializeTable(spark: SparkSession): DataFrame = { > var sourceDF = spark.sqlContext.range(0, > numberOfRows).withColumnRenamed("id", > idColumn) > val columns = sourceDF(idColumn) +: (0 until numberOfColumns) > .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), > getCategoryColumn(x))) > sourceDF.select(columns: _*) > } > private def getTimeColumn(seed: Int): Column = { > val uniqueSeed = seed + numberOfColumns * 3 > rand(seed = uniqueSeed) >.multiply(maxEpoch - minEpoch) >.divide(1000).cast("long") >.plus(minEpoch / 1000).cast(TimestampType).alias(s"time$seed") > } > private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column > = { > val uniqueSeed = seed + numberOfColumns * 4 > randn(seed = uniqueSeed).alias(s"$namePrefix$seed") > } > private def getCategoryColumn(seed: Int): Column = { > val uniqueSeed = seed + numberOfColumns * 4 > rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed") > } > } > object GenerateData{ > def main(args: Array[String]): Unit = { > new RandomData(args(0).toInt, args(1).toInt).generateData(args(2)) > } > } > {code} > Please package a jar and run as follows: > {code:java} > spark-submit --master yarn \ > --driver-memory 12g \ > --executor-memory 12g \ > --deploy-mode cluster \ > --class GenerateData \ > --master yarn \ > 100 5000 "hdfs:///tmp/parquet" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse
[ https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23085. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20275 [https://github.com/apache/spark/pull/20275] > API parity for mllib.linalg.Vectors.sparse > --- > > Key: SPARK-23085 > URL: https://issues.apache.org/jira/browse/SPARK-23085 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.4.0 > > > Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: > {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support > zero-length vectors. > In old MLLib, > {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}])}} also supports it. > However, > {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a > positve length. > > {code:java} > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > java.lang.IllegalArgumentException: requirement failed: The size of the > requested sparse vector must be greater than 0. > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) > ... 50 elided > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse
[ https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23085: - Assignee: zhengruifeng > API parity for mllib.linalg.Vectors.sparse > --- > > Key: SPARK-23085 > URL: https://issues.apache.org/jira/browse/SPARK-23085 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.4.0 > > > Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: > {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support > zero-length vectors. > In old MLLib, > {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}])}} also supports it. > However, > {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a > positve length. > > {code:java} > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > java.lang.IllegalArgumentException: requirement failed: The size of the > requested sparse vector must be greater than 0. > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) > ... 50 elided > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-23156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krystian Zawistowski updated SPARK-23156: - Description: I am getting this trying to generate a random DataFrame (300 columns, 5000 rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not identical) to SPARK-18492 and few tickets more that should be done in 2.1.1. Part of the logs below. They contain hundreds of millions of lines of generated code, apparently for each of the 1500 000 fields of the dataframe which is very suspicious. {code:java} 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$ 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB$ /* 001 */ public java.lang.Object generate(Object[] references) {$ /* 002 */ return new SpecificUnsafeProjection(references);$ /* 003 */ }$ /* 004 */$ /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$ /* 006 */$ /* 007 */ private Object[] references;$ /* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$ /* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$ /* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$ /* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$ /* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$ {code} Reproduction: {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.{Column, DataFrame, SparkSession} class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends Serializable { private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime val idColumn = "id" import org.apache.spark.sql.functions._ def generateData(path: String): Unit = { val spark: SparkSession = SparkSession.builder().getOrCreate() materializeTable(spark).write.parquet(path + "/source") } private def materializeTable(spark: SparkSession): DataFrame = { var sourceDF = spark.sqlContext.range(0, numberOfRows).withColumnRenamed("id", idColumn) val columns = sourceDF(idColumn) +: (0 until numberOfColumns) .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), getCategoryColumn(x))) sourceDF.select(columns: _*) } private def getTimeColumn(seed: Int): Column = { val uniqueSeed = seed + numberOfColumns * 3 rand(seed = uniqueSeed) .multiply(maxEpoch - minEpoch) .divide(1000).cast("long") .plus(minEpoch / 1000).cast(TimestampType).alias(s"time$seed") } private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column = { val uniqueSeed = seed + numberOfColumns * 4 randn(seed = uniqueSeed).alias(s"$namePrefix$seed") } private def getCategoryColumn(seed: Int): Column = { val uniqueSeed = seed + numberOfColumns * 4 rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed") } } object GenerateData{ def main(args: Array[String]): Unit = { new RandomData(args(0).toInt, args(1).toInt).generateData(args(2)) } } {code} Please package a jar and run as follows: {code:java} spark-submit --master yarn \ --driver-memory 12g \ --executor-memory 12g \ --deploy-mode cluster \ --class GenerateData \ --master yarn \ 100 5000 "hdfs:///tmp/parquet" {code} was: I am getting this trying to generate a random DataFrame (300 columns, 5000 rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not identical) to SPARK-18492 and few tickets more that should be done in 2.1.1. Part of the logs below. They contain hundreds of millions of lines of generated code, apparently for each of the 1500 000 fields of the dataframe which is very suspicious. {code:java} 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$ 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB$ /* 001 */ public java.lang.Object generate(Object[] references) {$ /* 002 */ return new SpecificUnsafeProjection(references);$ /* 003 */ }$ /* 004 */$ /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$ /* 006 */$ /* 007 */ private Object[] references;$ /* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$ /* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$ /* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$ /* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$ /* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$ {code} Reproduction:
[jira] [Updated] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-23156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krystian Zawistowski updated SPARK-23156: - Description: I am getting this trying to generate a random DataFrame (300 columns, 5000 rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not identical) to SPARK-18492 and few tickets more that should be done in 2.1.1. Part of the logs below. They contain hundreds of millions of lines of generated code, apparently for each of the 1500 000 fields of the dataframe which is very suspicious. {code:java} 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$ 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB$ /* 001 */ public java.lang.Object generate(Object[] references) {$ /* 002 */ return new SpecificUnsafeProjection(references);$ /* 003 */ }$ /* 004 */$ /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$ /* 006 */$ /* 007 */ private Object[] references;$ /* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$ /* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$ /* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$ /* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$ /* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$ {code} Reproduction: {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.{Column, DataFrame, SparkSession} class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends Serializable { private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime val idColumn = "id" import org.apache.spark.sql.functions._ def generateData(path: String): Unit = { val spark: SparkSession = SparkSession.builder().getOrCreate() materializeTable(spark).write.parquet(path + "/source") } private def materializeTable(spark: SparkSession): DataFrame = { var sourceDF = spark.sqlContext.range(0, numberOfRows).withColumnRenamed("id", idColumn) val columns = sourceDF(idColumn) +: (0 until numberOfColumns) .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), getCategoryColumn(x))) sourceDF.select(columns: _*) } private def getTimeColumn(seed: Int): Column = { val uniqueSeed = seed + numberOfColumns * 3 rand(seed = uniqueSeed).multiply(maxEpoch - minEpoch).divide(1000).cast("long").plus(minEpoch / 1000).cast(TimestampType).alias(s"time$seed") } private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column = { val uniqueSeed = seed + numberOfColumns * 4 randn(seed = uniqueSeed).alias(s"$namePrefix$seed") } private def getCategoryColumn(seed: Int): Column = { val uniqueSeed = seed + numberOfColumns * 4 rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed") } } object GenerateData{ def main(args: Array[String]): Unit = { new RandomData(args(0).toInt, args(1).toInt).generateData(args(2)) } } {code} Please package a jar and run as follows: {code:java} spark-submit --master yarn \ --driver-memory 12g \ --executor-memory 12g \ --deploy-mode cluster \ --class GenerateData \ --master yarn \ 100 5000 "hdfs:///tmp/parquet" {code} was: I am getting this trying to generate a random DataFrame (300 columns, 5000 rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not identical) to SPARK-18492 and few tickets more that should be done in 2.1.1. Part of the logs below. They contain hundreds of millions of lines of generated code, apparently for each of the 1500 000 fields of the dataframe which is very suspicious. {code:java} 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$ 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB$ /* 001 */ public java.lang.Object generate(Object[] references) {$ /* 002 */ return new SpecificUnsafeProjection(references);$ /* 003 */ }$ /* 004 */$ /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$ /* 006 */$ /* 007 */ private Object[] references;$ /* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$ /* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$ /* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$ /* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$ /* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$ {code} Reproduction: {code:java}
[jira] [Updated] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-23156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krystian Zawistowski updated SPARK-23156: - Description: I am getting this trying to generate a random DataFrame (300 columns, 5000 rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not identical) to SPARK-18492 and few tickets more that should be done in 2.1.1. Part of the logs below. They contain hundreds of millions of lines of generated code, apparently for each of the 1500 000 fields of the dataframe which is very suspicious. {code:java} 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$ 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB$ /* 001 */ public java.lang.Object generate(Object[] references) {$ /* 002 */ return new SpecificUnsafeProjection(references);$ /* 003 */ }$ /* 004 */$ /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$ /* 006 */$ /* 007 */ private Object[] references;$ /* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$ /* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$ /* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$ /* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$ /* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$ {code} Reproduction: {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.{Column, DataFrame, SparkSession} class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends Serializable { private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime val idColumn = "id" import org.apache.spark.sql.functions._ def generateFeatureLearningData(path: String): Unit = { val spark: SparkSession = SparkSession.builder().getOrCreate() materializeSourceFeatureLearningTable(spark).write.parquet(path + "/source") materializeTargetTable(spark).write.parquet(path + "/target") } def generateModelLearningData(path: String): Unit = { val spark: SparkSession = SparkSession.builder().getOrCreate() materializeTargetTable(spark).write.parquet(path + "/target") materializeSourceModelLearningTable(spark).write.parquet(path + "/source") } private def materializeSourceFeatureLearningTable(spark: SparkSession): DataFrame = { var sourceDF = spark.sqlContext.range(0, numberOfRows).withColumnRenamed("id", idColumn) val columns = sourceDF(idColumn) +: (0 until numberOfColumns) .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), getCategoryColumn(x))) sourceDF.select(columns: _*) } private def getTimeColumn(seed: Int): Column = { val uniqueSeed = seed + numberOfColumns * 3 rand(seed = uniqueSeed).multiply(maxEpoch - minEpoch).divide(1000).cast("long").plus(minEpoch / 1000).cast(TimestampType).alias(s"time$seed") } private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column = { val uniqueSeed = seed + numberOfColumns * 4 randn(seed = uniqueSeed).alias(s"$namePrefix$seed") } private def getCategoryColumn(seed: Int): Column = { val uniqueSeed = seed + numberOfColumns * 4 rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed") } } object GenerateData{ def main(args: Array[String]): Unit = { new RandomData(args(0).toInt, args(1).toInt).generateFeatureLearningData(args(2)) } } {code} Please package a jar and run as follows: {code:java} spark-submit --master yarn \ --driver-memory 12g \ --executor-memory 12g \ --deploy-mode cluster \ --class GenerateData \ --master yarn \ 100 5000 "hdfs:///tmp/parquet" {code} was: I am getting this trying to generate a random DataFrame (300 columns, 5000 rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not identical) to SPARK-18492 and few tickets more that should be done in 2.1.1. Part of the logs below. They contain hundreds of millions of lines of generated code, apparently for each of the 1500 000 fields of the dataframe which is very suspicious. {code:java} 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$ 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB$ /* 001 */ public java.lang.Object generate(Object[] references) {$ /* 002 */ return new SpecificUnsafeProjection(references);$ /* 003 */ }$ /* 004 */$ /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$ /* 006 */$ /* 007 */ private
[jira] [Created] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
Krystian Zawistowski created SPARK-23156: Summary: Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB Key: SPARK-23156 URL: https://issues.apache.org/jira/browse/SPARK-23156 Project: Spark Issue Type: Bug Components: Spark Submit, SQL Affects Versions: 2.1.2, 2.1.1 Environment: Ubuntu 16.04, Scala 2.11, Java 8, 8-node YARN cluster. Reporter: Krystian Zawistowski I am getting this trying to generate a random DataFrame (300 columns, 5000 rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not identical) to SPARK-18492 and few tickets more that should be done in 2.1.1. Part of the logs below. They contain hundreds of millions of lines of generated code, apparently for each of the 1500 000 fields of the dataframe which is very suspicious. {code:java} 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$ 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB$ /* 001 */ public java.lang.Object generate(Object[] references) {$ /* 002 */ return new SpecificUnsafeProjection(references);$ /* 003 */ }$ /* 004 */$ /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$ /* 006 */$ /* 007 */ private Object[] references;$ /* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$ /* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$ /* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$ /* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$ /* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$ {code} Reproduction: {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.{Column, DataFrame, SparkSession} class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends Serializable { private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime val idColumn = "id" import org.apache.spark.sql.functions._ def generateFeatureLearningData(path: String): Unit = { val spark: SparkSession = SparkSession.builder().getOrCreate() materializeSourceFeatureLearningTable(spark).write.parquet(path + "/source") materializeTargetTable(spark).write.parquet(path + "/target") } def generateModelLearningData(path: String): Unit = { val spark: SparkSession = SparkSession.builder().getOrCreate() materializeTargetTable(spark).write.parquet(path + "/target") materializeSourceModelLearningTable(spark).write.parquet(path + "/source") } private def materializeSourceFeatureLearningTable(spark: SparkSession): DataFrame = { var sourceDF = spark.sqlContext.range(0, numberOfRows).withColumnRenamed("id", idColumn) val columns = sourceDF(idColumn) +: (0 until numberOfColumns) .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), getCategoryColumn(x))) sourceDF.select(columns: _*) } private def getTimeColumn(seed: Int): Column = { val uniqueSeed = seed + numberOfColumns * 3 rand(seed = uniqueSeed).multiply(maxEpoch - minEpoch).divide(1000).cast("long").plus(minEpoch / 1000).cast(TimestampType).alias(s"time$seed") } private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column = { val uniqueSeed = seed + numberOfColumns * 4 randn(seed = uniqueSeed).alias(s"$namePrefix$seed") } private def getCategoryColumn(seed: Int): Column = { val uniqueSeed = seed + numberOfColumns * 4 rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed") } } object GenerateData{ def main(args: Array[String]): Unit = { new RandomData(args(0).toInt, args(1).toInt).generateFeatureLearningData(args(2)) } } {code} Please package a jar and run as follows: {code} spark-submit --master yarn --driver-memory 12g --executor-memory 12g --deploy-mode cluster --class GenerateData --master yarn 100 5000 "hdfs:///tmp/parquet" {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22577) executor page blacklist status should update with TaskSet level blacklisting
[ https://issues.apache.org/jira/browse/SPARK-22577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-22577: --- Attachment: node_blacklisting_for_stage.png > executor page blacklist status should update with TaskSet level blacklisting > > > Key: SPARK-22577 > URL: https://issues.apache.org/jira/browse/SPARK-22577 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Thomas Graves >Priority: Major > Attachments: app_blacklisting.png, node_blacklisting_for_stage.png, > stage_blacklisting.png > > > right now the executor blacklist status only updates with the > BlacklistTracker after a task set has finished and propagated the > blacklisting to the application level. We should change that to show at the > taskset level as well. Without this it can be very confusing to the user why > things aren't running. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23024) Spark ui about the contents of the form need to have hidden and show features, when the table records very much.
[ https://issues.apache.org/jira/browse/SPARK-23024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23024: - Assignee: guoxiaolongzte > Spark ui about the contents of the form need to have hidden and show > features, when the table records very much. > - > > Key: SPARK-23024 > URL: https://issues.apache.org/jira/browse/SPARK-23024 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Assignee: guoxiaolongzte >Priority: Minor > Fix For: 2.4.0 > > Attachments: 1.png, 2.png > > > Spark ui about the contents of the form need to have hidden and show > features, when the table records very much. Because sometimes you do not care > about the record of the table, you just want to see the contents of the next > table, but you have to scroll the scroll bar for a long time to see the > contents of the next table. > Currently we have about 500 workers, but I just wanted to see the logs for > the running applications table. I had to scroll through the scroll bars for a > long time to see the logs for the running applications table. > In order to ensure functional consistency, I modified the Master Page, Worker > Page, Job Page, Stage Page, Task Page, Configuration Page, Storage Page, Pool > Page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23024) Spark ui about the contents of the form need to have hidden and show features, when the table records very much.
[ https://issues.apache.org/jira/browse/SPARK-23024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23024. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20216 [https://github.com/apache/spark/pull/20216] > Spark ui about the contents of the form need to have hidden and show > features, when the table records very much. > - > > Key: SPARK-23024 > URL: https://issues.apache.org/jira/browse/SPARK-23024 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Assignee: guoxiaolongzte >Priority: Minor > Fix For: 2.4.0 > > Attachments: 1.png, 2.png > > > Spark ui about the contents of the form need to have hidden and show > features, when the table records very much. Because sometimes you do not care > about the record of the table, you just want to see the contents of the next > table, but you have to scroll the scroll bar for a long time to see the > contents of the next table. > Currently we have about 500 workers, but I just wanted to see the logs for > the running applications table. I had to scroll through the scroll bars for a > long time to see the logs for the running applications table. > In order to ensure functional consistency, I modified the Master Page, Worker > Page, Job Page, Stage Page, Task Page, Configuration Page, Storage Page, Pool > Page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be
[ https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332300#comment-16332300 ] Sandor Murakozi commented on SPARK-23121: - One issue is with displaying old jobs. Depending on how old a job is it may or may not be displayed correctly. The bigger issue is that the main jobs page can also be affected. > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > - > > Key: SPARK-23121 > URL: https://issues.apache.org/jira/browse/SPARK-23121 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Major > Attachments: 1.png, 2.png > > > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > > Test command: > ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount > ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark > > The app is running for a period of time, ui can not be accessed, please see > attachment. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be
[ https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332294#comment-16332294 ] Sean Owen commented on SPARK-23121: --- Yes that sounds right. But doesn't it just cause an error when displaying pages for old jobs? it would be an 'error' of some kind no matter what, whether a 404 or "not found" message. It can be improved but didn't sound like it mattered beyond that. > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > - > > Key: SPARK-23121 > URL: https://issues.apache.org/jira/browse/SPARK-23121 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Major > Attachments: 1.png, 2.png > > > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > > Test command: > ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount > ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark > > The app is running for a period of time, ui can not be accessed, please see > attachment. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence
[ https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332252#comment-16332252 ] Nick Pentreath commented on SPARK-23154: SGTM > Document backwards compatibility guarantees for ML persistence > -- > > Key: SPARK-23154 > URL: https://issues.apache.org/jira/browse/SPARK-23154 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > > We have (as far as I know) maintained backwards compatibility for ML > persistence, but this is not documented anywhere. I'd like us to document it > (for spark.ml, not for spark.mllib). > I'd recommend something like: > {quote} > In general, MLlib maintains backwards compatibility for ML persistence. > I.e., if you save an ML model or Pipeline in one version of Spark, then you > should be able to load it back and use it in a future version of Spark. > However, there are rare exceptions, described below. > Model persistence: Is a model or Pipeline saved using Apache Spark ML > persistence in Spark version X loadable by Spark version Y? > * Major versions: No guarantees, but best-effort. > * Minor and patch versions: Yes; these are backwards compatible. > * Note about the format: There are no guarantees for a stable persistence > format, but model loading itself is designed to be backwards compatible. > Model behavior: Does a model or Pipeline in Spark version X behave > identically in Spark version Y? > * Major versions: No guarantees, but best-effort. > * Minor and patch versions: Identical behavior, except for bug fixes. > For both model persistence and model behavior, any breaking changes across a > minor version or patch version are reported in the Spark version release > notes. If a breakage is not reported in release notes, then it should be > treated as a bug to be fixed. > {quote} > How does this sound? > Note: We unfortunately don't have tests for backwards compatibility (which > has technical hurdles and can be discussed in [SPARK-15573]). However, we > have made efforts to maintain it during PR review and Spark release QA, and > most users expect it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be
[ https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332249#comment-16332249 ] Sandor Murakozi commented on SPARK-23121: - [~guoxiaolongzte] found two separate problems, bot triggered by having a high number of jobs/stages. In such a situation the store of the history server drops various objects to save memory. It may happen that the job itself is in the store, but its stages or the RDDOperationGraph are not. In such cases rendering of the all jobs and the job pages fails. As a consequence, the jobs page may become inaccessible if the cluster processes many jobs, so I think the priority of this issue should be increased. What do you think [~srowen] ? > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > - > > Key: SPARK-23121 > URL: https://issues.apache.org/jira/browse/SPARK-23121 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Major > Attachments: 1.png, 2.png > > > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > > Test command: > ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount > ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark > > The app is running for a period of time, ui can not be accessed, please see > attachment. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be
[ https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332215#comment-16332215 ] Apache Spark commented on SPARK-23121: -- User 'smurakozi' has created a pull request for this issue: https://github.com/apache/spark/pull/20330 > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > - > > Key: SPARK-23121 > URL: https://issues.apache.org/jira/browse/SPARK-23121 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Major > Attachments: 1.png, 2.png > > > When the Spark Streaming app is running for a period of time, the page is > incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' > and ui can not be accessed. > > Test command: > ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount > ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark > > The app is running for a period of time, ui can not be accessed, please see > attachment. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15467) Getting stack overflow when attempting to query a wide Dataset (>200 fields)
[ https://issues.apache.org/jira/browse/SPARK-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332211#comment-16332211 ] Artem Kalchenko commented on SPARK-15467: - I guess I'm still experiencing this issue with Spark 2.2 {noformat} 18/01/19 12:32:28 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. Exception in thread "main" java.lang.StackOverflowError at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541){noformat} > Getting stack overflow when attempting to query a wide Dataset (>200 fields) > > > Key: SPARK-15467 > URL: https://issues.apache.org/jira/browse/SPARK-15467 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Don Drake >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.1.0 > > > This can be duplicated in a spark-shell, I am running Spark 2.0.0-preview. > {code} > import spark.implicits._ > case class Wide( > val f0:String = "", > val f1:String = "", > val f2:String = "", > val f3:String = "", > val f4:String = "", > val f5:String = "", > val f6:String = "", > val f7:String = "", > val f8:String = "", > val f9:String = "", > val f10:String = "", > val f11:String = "", > val f12:String = "", > val f13:String = "", > val f14:String = "", > val f15:String = "", > val f16:String = "", > val f17:String = "", > val f18:String = "", > val f19:String = "", > val f20:String = "", > val f21:String = "", > val f22:String = "", > val f23:String = "", > val f24:String = "", > val f25:String = "", > val f26:String = "", > val f27:String = "", > val f28:String = "", > val f29:String = "", > val f30:String = "", > val f31:String = "", > val f32:String = "", > val f33:String = "", > val f34:String = "", > val f35:String = "", > val f36:String = "", > val f37:String = "", > val f38:String = "", > val f39:String = "", > val f40:String = "", > val f41:String = "", > val f42:String = "", > val f43:String = "", > val f44:String = "", > val f45:String = "", > val f46:String = "", > val f47:String = "", > val f48:String = "", > val f49:String = "", > val f50:String = "", > val f51:String = "", > val f52:String = "", > val f53:String = "", > val f54:String = "", > val f55:String = "", > val f56:String = "", > val f57:String = "", > val f58:String = "", > val f59:String = "", > val f60:String = "", > val f61:String = "", > val f62:String = "", > val f63:String = "", > val f64:String = "", > val f65:String = "", > val f66:String = "", > val f67:String = "", > val f68:String = "", > val f69:String = "", > val f70:String = "", > val f71:String = "", > val f72:String = "", > val f73:String = "", > val f74:String = "", > val f75:String = "", > val f76:String = "", > val f77:String = "", > val f78:String = "", > val f79:String = "", > val f80:String = "", > val f81:String = "", > val f82:String = "", > val f83:String = "", > val f84:String = "", > val f85:String = "", > val f86:String = "", > val f87:String = "", > val f88:String = "", > val f89:String = "", > val f90:String = "", > val f91:String = "", > val f92:String = "", > val f93:String = "", > val f94:String = "", > val f95:String = "", > val f96:String = "", > val f97:String = "", > val f98:String = "", > val f99:String = "", > val f100:String = "", > val f101:String = "", > val f102:String = "", > val f103:String = "", > val f104:String = "", > val f105:String = "", > val f106:String = "", > val f107:String = "", > val f108:String = "", > val f109:String = "", > val f110:String = "", > val f111:String = "", > val f112:String = "", > val f113:String = "", > val f114:String = "", > val f115:String = "", > val f116:String = "", > val f117:String = "", > val f118:String = "", > val f119:String = "", > val f120:String = "", > val f121:String = "", > val f122:String = "", > val f123:String = "", > val f124:String = "", > val f125:String = "", > val f126:String = "", > val f127:String = "", > val f128:String = "", > val f129:String = "", > val f130:String = "", > val f131:String = "", > val f132:String = "", > val f133:String = "", > val f134:String = "", > val f135:String = "", > val f136:String = "", > val f137:String = "", > val f138:String = "", > val f139:String = "", > val f140:String = "", > val f141:String = "", > val f142:String = "", > val f143:String = "", > val f144:String = "", > val f145:String = "", > val f146:String = "", > val
[jira] [Assigned] (SPARK-23089) "Unable to create operation log session directory" when parent directory not present
[ https://issues.apache.org/jira/browse/SPARK-23089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23089: --- Assignee: Marco Gaido > "Unable to create operation log session directory" when parent directory not > present > > > Key: SPARK-23089 > URL: https://issues.apache.org/jira/browse/SPARK-23089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: > /usr/hdp/2.6.3.0-235/spark2/jars/spark-hive-thriftserver_2.11-2.2.0.2.6.3.0-235.jar > $ cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.4 (Maipo) > $ ps aux|grep ^hive.*spark.*thrift > hive 1468503 0.9 0.5 13319628 1411676 ?Sl Jan15 10:18 > /usr/java/default/bin/java -Dhdp.version=2.6.3.0-235 -cp > /usr/hdp/current/spark2-thriftserver/conf/:/usr/hdp/current/spark2-thriftserver/jars/*:/usr/hdp/current/hadoop-client/conf/ > -Xmx2048m org.apache.spark.deploy.SparkSubmit --properties-file > /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf --class > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift > JDBC/ODBC Server spark-internal >Reporter: Sean Roberts >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.3.0 > > > When creating a session directory, Thrift should create the parent directory > _(i.e. /tmp/hive/operation_logs)_ if it is not present. > It's common for operators to clean-up old and empty directories in /tmp, or > to have tools (systemd-tmpfiles or tmpwatch) that do it automatically. > This was fixed in HIVE-12262 but not in Spark Thrift as seen by this: > {code}18/01/15 14:22:49 WARN HiveSessionImpl: Unable to create operation log > session directory: > /tmp/hive/operation_logs/683a6318-adc4-42c4-b665-11dad14d7ec7{code} > Resolved by manually creating /tmp/hive/operation_logs/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23089) "Unable to create operation log session directory" when parent directory not present
[ https://issues.apache.org/jira/browse/SPARK-23089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23089. - Resolution: Fixed Fix Version/s: 2.3.0 > "Unable to create operation log session directory" when parent directory not > present > > > Key: SPARK-23089 > URL: https://issues.apache.org/jira/browse/SPARK-23089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: > /usr/hdp/2.6.3.0-235/spark2/jars/spark-hive-thriftserver_2.11-2.2.0.2.6.3.0-235.jar > $ cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.4 (Maipo) > $ ps aux|grep ^hive.*spark.*thrift > hive 1468503 0.9 0.5 13319628 1411676 ?Sl Jan15 10:18 > /usr/java/default/bin/java -Dhdp.version=2.6.3.0-235 -cp > /usr/hdp/current/spark2-thriftserver/conf/:/usr/hdp/current/spark2-thriftserver/jars/*:/usr/hdp/current/hadoop-client/conf/ > -Xmx2048m org.apache.spark.deploy.SparkSubmit --properties-file > /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf --class > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift > JDBC/ODBC Server spark-internal >Reporter: Sean Roberts >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.3.0 > > > When creating a session directory, Thrift should create the parent directory > _(i.e. /tmp/hive/operation_logs)_ if it is not present. > It's common for operators to clean-up old and empty directories in /tmp, or > to have tools (systemd-tmpfiles or tmpwatch) that do it automatically. > This was fixed in HIVE-12262 but not in Spark Thrift as seen by this: > {code}18/01/15 14:22:49 WARN HiveSessionImpl: Unable to create operation log > session directory: > /tmp/hive/operation_logs/683a6318-adc4-42c4-b665-11dad14d7ec7{code} > Resolved by manually creating /tmp/hive/operation_logs/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332135#comment-16332135 ] Lucas Partridge commented on SPARK-7146: What's the semantic difference between HasFeaturesCol and HasInputCol, please? > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley >Assignee: holdenk >Priority: Major > Fix For: 2.3.0 > > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332133#comment-16332133 ] ANDY GUAN commented on SPARK-23148: --- Looks like the same problem with [SPARK-21996|https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel]. I get the problem fixed by making the following change: =DataSourceScanExec.scala= line:441 Seq(PartitionedFile( partition.values, file.getPath.toUri.toString, 0, file.getLen, hosts)) ==> Seq(PartitionedFile( partition.values, file.getPath.toString, 0, file.getLen, hosts)) == Can you help to make a pull request? > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bogdan Raducanu >Priority: Major > > Repro code: > {code:java} > spark.range(10).write.csv("/tmp/a b c/a.csv") > spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count > 10 > spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count > java.io.FileNotFoundException: File > file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv > does not exist > {code} > Trying to manually escape fails in a different place: > {code} > spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/tmp/a%20b%20c/a.csv; > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator
[ https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-23048: -- Assignee: Liang-Chi Hsieh > Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator > --- > > Key: SPARK-23048 > URL: https://issues.apache.org/jira/browse/SPARK-23048 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > > Since we're deprecating OneHotEncoder, we should update the docs to reference > it's replacement, OneHotEncoderEstimator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator
[ https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-23048. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20257 [https://github.com/apache/spark/pull/20257] > Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator > --- > > Key: SPARK-23048 > URL: https://issues.apache.org/jira/browse/SPARK-23048 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > > Since we're deprecating OneHotEncoder, we should update the docs to reference > it's replacement, OneHotEncoderEstimator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23127) Update FeatureHasher user guide for catCols parameter
[ https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-23127. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20293 [https://github.com/apache/spark/pull/20293] > Update FeatureHasher user guide for catCols parameter > - > > Key: SPARK-23127 > URL: https://issues.apache.org/jira/browse/SPARK-23127 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Nick Pentreath >Priority: Major > Fix For: 2.3.0 > > > SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and > Python doc, but did not update the user guide entry discussing feature > handling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23127) Update FeatureHasher user guide for catCols parameter
[ https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-23127: -- Assignee: Nick Pentreath > Update FeatureHasher user guide for catCols parameter > - > > Key: SPARK-23127 > URL: https://issues.apache.org/jira/browse/SPARK-23127 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Nick Pentreath >Priority: Major > > SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and > Python doc, but did not update the user guide entry discussing feature > handling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332041#comment-16332041 ] Bogdan Raducanu edited comment on SPARK-23148 at 1/19/18 10:18 AM: --- I updated the description with manul escape, if that is what you meant was (Author: bograd): What do you mean? > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bogdan Raducanu >Priority: Major > > Repro code: > {code:java} > spark.range(10).write.csv("/tmp/a b c/a.csv") > spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count > 10 > spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count > java.io.FileNotFoundException: File > file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv > does not exist > {code} > Trying to manually escape fails in a different place: > {code} > spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/tmp/a%20b%20c/a.csv; > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Raducanu updated SPARK-23148: Description: Repro code: {code:java} spark.range(10).write.csv("/tmp/a b c/a.csv") spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count 10 spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count java.io.FileNotFoundException: File file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv does not exist {code} Trying to manually escape fails in a different place: {code} spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/a%20b%20c/a.csv; at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) {code} was: Repro code: {code:java} spark.range(10).write.csv("/tmp/a b c/a.csv") spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count 10 spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count java.io.FileNotFoundException: File file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv does not exist {code} > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bogdan Raducanu >Priority: Major > > Repro code: > {code:java} > spark.range(10).write.csv("/tmp/a b c/a.csv") > spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count > 10 > spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count > java.io.FileNotFoundException: File > file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv > does not exist > {code} > Trying to manually escape fails in a different place: > {code} > spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/tmp/a%20b%20c/a.csv; > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332041#comment-16332041 ] Bogdan Raducanu commented on SPARK-23148: - What do you mean? > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bogdan Raducanu >Priority: Major > > Repro code: > {code:java} > spark.range(10).write.csv("/tmp/a b c/a.csv") > spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count > 10 > spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count > java.io.FileNotFoundException: File > file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv > does not exist > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331870#comment-16331870 ] Apache Spark commented on SPARK-23000: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20328 > Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3 > - > > Key: SPARK-23000 > URL: https://issues.apache.org/jira/browse/SPARK-23000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/ > The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always > failed in hadoop 2.6 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org