date:20180119

[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-19 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698
 ] 

Bryan Cutler edited comment on SPARK-23109 at 1/20/18 3:17 AM:
---

I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
for both of the above https://issues.apache.org/jira/browse/SPARK-23161

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj - 
https://issues.apache.org/jira/browse/SPARK-23162

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005

for the above DOC issues https://issues.apache.org/jira/browse/SPARK-23163


was (Author: bryanc):
I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
for both of the above https://issues.apache.org/jira/browse/SPARK-23161

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj - 
https://issues.apache.org/jira/browse/SPARK-23162

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To

[jira] [Commented] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-19 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333154#comment-16333154
 ] 

Bryan Cutler commented on SPARK-23163:
--

I'll do this, just a few minor things

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Trivial
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-19 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-23163:


 Summary: Sync Python ML API docs with Scala
 Key: SPARK-23163
 URL: https://issues.apache.org/jira/browse/SPARK-23163
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler


Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-19 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698
 ] 

Bryan Cutler edited comment on SPARK-23109 at 1/20/18 3:06 AM:
---

I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
for both of the above https://issues.apache.org/jira/browse/SPARK-23161

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj - 
https://issues.apache.org/jira/browse/SPARK-23162

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005


was (Author: bryanc):
I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
for both of the above https://issues.apache.org/jira/browse/SPARK-23161

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj - 
https://issues.apache.org/jira/browse/SPARK-23162

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-19 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698
 ] 

Bryan Cutler edited comment on SPARK-23109 at 1/20/18 3:05 AM:
---

I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
for both of the above https://issues.apache.org/jira/browse/SPARK-23161

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj - 
https://issues.apache.org/jira/browse/SPARK-23162

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005


was (Author: bryanc):
I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
https://issues.apache.org/jira/browse/SPARK-23161 for the above

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-01-19 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23161:
-
Labels: starter  (was: )

> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>  Labels: starter
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj

2018-01-19 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-23162:


 Summary: PySpark ML LinearRegressionSummary missing r2adj
 Key: SPARK-23162
 URL: https://issues.apache.org/jira/browse/SPARK-23162
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler


Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-01-19 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23161:
-
Priority: Minor  (was: Major)

> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>  Labels: starter
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-19 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698
 ] 

Bryan Cutler edited comment on SPARK-23109 at 1/20/18 3:00 AM:
---

I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
https://issues.apache.org/jira/browse/SPARK-23161 for the above

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005


was (Author: bryanc):
I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-01-19 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-23161:


 Summary: Add missing APIs to Python GBTClassifier
 Key: SPARK-23161
 URL: https://issues.apache.org/jira/browse/SPARK-23161
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler


GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved 
{{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.

GBTClassificationModel is missing {{numClasses}}. It should inherit from 
{{JavaClassificationModel}} instead of prediction model which will give it this 
param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be a

2018-01-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23121:
---
Affects Version/s: (was: 2.4.0)
   2.3.0

> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
> -
>
> Key: SPARK-23121
> URL: https://issues.apache.org/jira/browse/SPARK-23121
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
>  
> Test command:
> ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount 
> ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark
>  
> The app is running for a period of time,  ui can not be accessed, please see 
> attachment.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be a

2018-01-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23121:
---
Target Version/s: 2.3.0

> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
> -
>
> Key: SPARK-23121
> URL: https://issues.apache.org/jira/browse/SPARK-23121
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
>  
> Test command:
> ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount 
> ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark
>  
> The app is running for a period of time,  ui can not be accessed, please see 
> attachment.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23091) Incorrect unit test for approxQuantile

2018-01-19 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23091:

Component/s: SQL

> Incorrect unit test for approxQuantile
> --
>
> Key: SPARK-23091
> URL: https://issues.apache.org/jira/browse/SPARK-23091
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL, Tests
>Affects Versions: 2.2.1
>Reporter: Kuang Chen
>Priority: Minor
>
> Currently, test for `approxQuantile` (quantile estimation algorithm) checks 
> whether estimated quantile is within +- 2*`relativeError` from the true 
> quantile. See the code below:
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157]
> However, based on the original paper by Greenwald and Khanna, the estimated 
> quantile is guaranteed to be within +- `relativeError` from the true 
> quantile. Using the double "tolerance" is misleading and incorrect, and we 
> should fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21771) SparkSQLEnv creates a useless meta hive client

2018-01-19 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21771.
-
   Resolution: Fixed
 Assignee: Kent Yao
Fix Version/s: 2.3.0

> SparkSQLEnv creates a useless meta hive client
> --
>
> Key: SPARK-21771
> URL: https://issues.apache.org/jira/browse/SPARK-21771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 2.3.0
>
>
> Once a meta hive client is created, it generates its SessionState which 
> creates a lot of session related directories, some deleteOnExit, some does 
> not. if a hive client is useless we may not create it at the very start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-19 Thread Henry Robinson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332999#comment-16332999
 ] 

Henry Robinson commented on SPARK-23148:


It seems like the problem is that 
{{CodecStreams.createInputStreamWithCloseResource}} can't properly handle a 
{{path}} argument that's URL-encoded. We could add an overload for 
{{createInputStreamWithCloseResource(Configuration, Path)}} and then pass {{new 
Path(new URI(path))}} from {{CSVDataSource.readFile()}}. This has the benefit 
of being a more localised change (and doesn't change the 'contract' that comes 
from {{FileScanRDD}} currently having URL-encoded pathnames everywhere. A 
strawman commit is 
[here|https://github.com/henryr/spark/commit/b8c51418ee7d4bca18179fd863f7f4885c98c0ef].

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-19 Thread Henry Robinson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332999#comment-16332999
 ] 

Henry Robinson edited comment on SPARK-23148 at 1/19/18 11:25 PM:
--

It seems like the problem is that 
{{CodecStreams.createInputStreamWithCloseResource}} can't properly handle a 
{{path}} argument that's URL-encoded. We could add an overload for 
{{createInputStreamWithCloseResource(Configuration, Path)}} and then pass {{new 
Path(new URI(path))}} from {{CSVDataSource.readFile()}}. This has the benefit 
of being a more localised change (and doesn't change the 'contract' that comes 
from {{FileScanRDD}} currently having URL-encoded pathnames everywhere). A 
strawman commit is 
[here|https://github.com/henryr/spark/commit/b8c51418ee7d4bca18179fd863f7f4885c98c0ef].


was (Author: henryr):
It seems like the problem is that 
{{CodecStreams.createInputStreamWithCloseResource}} can't properly handle a 
{{path}} argument that's URL-encoded. We could add an overload for 
{{createInputStreamWithCloseResource(Configuration, Path)}} and then pass {{new 
Path(new URI(path))}} from {{CSVDataSource.readFile()}}. This has the benefit 
of being a more localised change (and doesn't change the 'contract' that comes 
from {{FileScanRDD}} currently having URL-encoded pathnames everywhere. A 
strawman commit is 
[here|https://github.com/henryr/spark/commit/b8c51418ee7d4bca18179fd863f7f4885c98c0ef].

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23087) CheckCartesianProduct too restrictive when condition is constant folded to false/null

2018-01-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23087:


Assignee: (was: Apache Spark)

> CheckCartesianProduct too restrictive when condition is constant folded to 
> false/null
> -
>
> Key: SPARK-23087
> URL: https://issues.apache.org/jira/browse/SPARK-23087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Running
> {code}
> sql("SELECT id as a FROM RANGE(10)").createOrReplaceTempView("A")
> sql("SELECT NULL as a FROM RANGE(10)").createOrReplaceTempView("NULLTAB")
> sql("SELECT 1 as goo FROM A LEFT OUTER JOIN NULLTAB ON A.a = 
> NULLTAB.a").collect()
> {code}
> results in:
> {code}
> org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT 
> OUTER join between logical plans
> Project
> +- Range (0, 10, step=1, splits=None)
> and
> Project
> +- Range (0, 10, step=1, splits=None)
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
>  
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
> {code}
> This is because NULLTAB.a is constant folded to null, and then the condition 
> is constant folded altogether:
> {code}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation ===
> GlobalLimit 21  
>  +- LocalLimit 21
> +- Project [1 AS goo#28] 
> !  +- Join LeftOuter, (a#0L = null)  
>   :- Project [id#1L AS a#0L] 
>   :  +- Range (0, 10, step=1, splits=None)   
>   +- Project  
>  +- Range (0, 10, step=1, splits=None) 
> GlobalLimit 21
> +- LocalLimit 21
>+- Project [1 AS goo#28]
>   +- Join LeftOuter, null
>  :- Project [id#1L AS a#0L]
>  :  +- Range (0, 10, step=1, splits=None)
>  +- Project
> +- Range (0, 10, step=1, splits=None)
> {code}
> And then CheckCartesianProduct doesn't like it, even though the condition 
> does not produce a cartesian product, but evaluates to null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23087) CheckCartesianProduct too restrictive when condition is constant folded to false/null

2018-01-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23087:


Assignee: Apache Spark

> CheckCartesianProduct too restrictive when condition is constant folded to 
> false/null
> -
>
> Key: SPARK-23087
> URL: https://issues.apache.org/jira/browse/SPARK-23087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Minor
>
> Running
> {code}
> sql("SELECT id as a FROM RANGE(10)").createOrReplaceTempView("A")
> sql("SELECT NULL as a FROM RANGE(10)").createOrReplaceTempView("NULLTAB")
> sql("SELECT 1 as goo FROM A LEFT OUTER JOIN NULLTAB ON A.a = 
> NULLTAB.a").collect()
> {code}
> results in:
> {code}
> org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT 
> OUTER join between logical plans
> Project
> +- Range (0, 10, step=1, splits=None)
> and
> Project
> +- Range (0, 10, step=1, splits=None)
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
>  
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
> {code}
> This is because NULLTAB.a is constant folded to null, and then the condition 
> is constant folded altogether:
> {code}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation ===
> GlobalLimit 21  
>  +- LocalLimit 21
> +- Project [1 AS goo#28] 
> !  +- Join LeftOuter, (a#0L = null)  
>   :- Project [id#1L AS a#0L] 
>   :  +- Range (0, 10, step=1, splits=None)   
>   +- Project  
>  +- Range (0, 10, step=1, splits=None) 
> GlobalLimit 21
> +- LocalLimit 21
>+- Project [1 AS goo#28]
>   +- Join LeftOuter, null
>  :- Project [id#1L AS a#0L]
>  :  +- Range (0, 10, step=1, splits=None)
>  +- Project
> +- Range (0, 10, step=1, splits=None)
> {code}
> And then CheckCartesianProduct doesn't like it, even though the condition 
> does not produce a cartesian product, but evaluates to null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23087) CheckCartesianProduct too restrictive when condition is constant folded to false/null

2018-01-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332907#comment-16332907
 ] 

Apache Spark commented on SPARK-23087:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20333

> CheckCartesianProduct too restrictive when condition is constant folded to 
> false/null
> -
>
> Key: SPARK-23087
> URL: https://issues.apache.org/jira/browse/SPARK-23087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Running
> {code}
> sql("SELECT id as a FROM RANGE(10)").createOrReplaceTempView("A")
> sql("SELECT NULL as a FROM RANGE(10)").createOrReplaceTempView("NULLTAB")
> sql("SELECT 1 as goo FROM A LEFT OUTER JOIN NULLTAB ON A.a = 
> NULLTAB.a").collect()
> {code}
> results in:
> {code}
> org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT 
> OUTER join between logical plans
> Project
> +- Range (0, 10, step=1, splits=None)
> and
> Project
> +- Range (0, 10, step=1, splits=None)
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
>  
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
> {code}
> This is because NULLTAB.a is constant folded to null, and then the condition 
> is constant folded altogether:
> {code}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation ===
> GlobalLimit 21  
>  +- LocalLimit 21
> +- Project [1 AS goo#28] 
> !  +- Join LeftOuter, (a#0L = null)  
>   :- Project [id#1L AS a#0L] 
>   :  +- Range (0, 10, step=1, splits=None)   
>   +- Project  
>  +- Range (0, 10, step=1, splits=None) 
> GlobalLimit 21
> +- LocalLimit 21
>+- Project [1 AS goo#28]
>   +- Join LeftOuter, null
>  :- Project [id#1L AS a#0L]
>  :  +- Range (0, 10, step=1, splits=None)
>  +- Project
> +- Range (0, 10, step=1, splits=None)
> {code}
> And then CheckCartesianProduct doesn't like it, even though the condition 
> does not produce a cartesian product, but evaluates to null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-19 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23135.

Resolution: Fixed

> Accumulators don't show up properly in the Stages page anymore
> --
>
> Key: SPARK-23135
> URL: https://issues.apache.org/jira/browse/SPARK-23135
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
> Environment:  
>  
>  
>Reporter: Burak Yavuz
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: webUIAccumulatorRegression.png
>
>
> Didn't do a lot of digging but may be caused by:
> [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]
>  
> !webUIAccumulatorRegression.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23087) CheckCartesianProduct too restrictive when condition is constant folded to false/null

2018-01-19 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-23087:

Priority: Minor  (was: Major)

> CheckCartesianProduct too restrictive when condition is constant folded to 
> false/null
> -
>
> Key: SPARK-23087
> URL: https://issues.apache.org/jira/browse/SPARK-23087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Running
> {code}
> sql("SELECT id as a FROM RANGE(10)").createOrReplaceTempView("A")
> sql("SELECT NULL as a FROM RANGE(10)").createOrReplaceTempView("NULLTAB")
> sql("SELECT 1 as goo FROM A LEFT OUTER JOIN NULLTAB ON A.a = 
> NULLTAB.a").collect()
> {code}
> results in:
> {code}
> org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT 
> OUTER join between logical plans
> Project
> +- Range (0, 10, step=1, splits=None)
> and
> Project
> +- Range (0, 10, step=1, splits=None)
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
>  
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
> {code}
> This is because NULLTAB.a is constant folded to null, and then the condition 
> is constant folded altogether:
> {code}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation ===
> GlobalLimit 21  
>  +- LocalLimit 21
> +- Project [1 AS goo#28] 
> !  +- Join LeftOuter, (a#0L = null)  
>   :- Project [id#1L AS a#0L] 
>   :  +- Range (0, 10, step=1, splits=None)   
>   +- Project  
>  +- Range (0, 10, step=1, splits=None) 
> GlobalLimit 21
> +- LocalLimit 21
>+- Project [1 AS goo#28]
>   +- Join LeftOuter, null
>  :- Project [id#1L AS a#0L]
>  :  +- Range (0, 10, step=1, splits=None)
>  +- Project
> +- Range (0, 10, step=1, splits=None)
> {code}
> And then CheckCartesianProduct doesn't like it, even though the condition 
> does not produce a cartesian product, but evaluates to null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-19 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reassigned SPARK-23135:
--

 Assignee: Marcelo Vanzin
Fix Version/s: 2.3.0

Issue resolved by pull request 20299 https://github.com/apache/spark/pull/20299

> Accumulators don't show up properly in the Stages page anymore
> --
>
> Key: SPARK-23135
> URL: https://issues.apache.org/jira/browse/SPARK-23135
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
> Environment:  
>  
>  
>Reporter: Burak Yavuz
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: webUIAccumulatorRegression.png
>
>
> Didn't do a lot of digging but may be caused by:
> [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]
>  
> !webUIAccumulatorRegression.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2018-01-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18085.

   Resolution: Fixed
Fix Version/s: 2.3.0

All of the sub-tasks of the SPIP are committed, so I'm closing this out. There 
are still a whole bunch of enhancements that can be done on top of the new 
stuff, but those can be added later.

Thanks to all who helped with reviews and testing!

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2018-01-19 Thread Gera Shegalov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-12963:
--
Shepherd: Sean Owen

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2018-01-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-21994:
---
Attachment: (was: Srinivasa Reddy Vundela.url)

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>Priority: Major
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path (causing the data to be saved as external 
> table), it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> A second workaround is to update the metadata of the managed table created by 
> Spark 2.2:
> {code}
> spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
> ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
> spark.catalog.refreshTable("test.spark22_test")
> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2018-01-19 Thread Zikun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zikun updated SPARK-21994:
--
Attachment: Srinivasa Reddy Vundela.url

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>Priority: Major
> Attachments: Srinivasa Reddy Vundela.url
>
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path (causing the data to be saved as external 
> table), it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> A second workaround is to update the metadata of the managed table created by 
> Spark 2.2:
> {code}
> spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
> ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
> spark.catalog.refreshTable("test.spark22_test")
> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11499) Spark History Server UI should respect protocol when doing redirection

2018-01-19 Thread paul mackles (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332811#comment-16332811
 ] 

paul mackles commented on SPARK-11499:
--

We ran into this issue running the spark-history server as a Marathon app on a 
Mesos cluster. As is typical for this kind of setup, there is a reverse-proxy 
that users go through to access the app. In our case, we are also offloading 
SSL to the reverse-proxy so communications between the reverse-proxy and 
spark-history are plain-old HTTP. I experimented with 2 different fixes:
 # Making sure that the SparkUI and History components look at 
APPLICATION_WEB_PROXY_BASE when generating redirect URLs. In order for it to 
honor the protocol, APPLICATION_WEB_PROXY_BASE must include the desired 
protocol (i.e. APPLICATION_WEB_PROXY_BASE=https://example.com)
 # Using Jetty's built-in ForwardRequestCustomizer class to process 
"X-Forwarded-*" headers defined in rfc7239. 

Both changes worked in our environment and both changes are fairly simple. 
Looking for feedback on whether one solution is preferable to the other. For 
our environment, #2 is preferable because:
 * The reverse proxy we use is already sending these headers. 
 * Allows for the spark-history server to see the actual client info as opposed 
to that of the proxy

If no strong feelings one way or another, I'll submit a PR for solution #2. 

References:
 * [https://tools.ietf.org/html/rfc7239]
 * 
[http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/ForwardedRequestCustomizer.html]

 

 

> Spark History Server UI should respect protocol when doing redirection
> --
>
> Key: SPARK-11499
> URL: https://issues.apache.org/jira/browse/SPARK-11499
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Lukasz Jastrzebski
>Priority: Major
>
> Use case:
> Spark history server is behind load balancer secured with ssl certificate,
> unfortunately clicking on the application link redirects it to http protocol, 
> which may be not expose by load balancer, example flow:
> *   Trying 52.22.220.1...
> * Connected to xxx.yyy.com (52.22.220.1) port 8775 (#0)
> * WARNING: SSL: Certificate type not set, assuming PKCS#12 format.
> * Client certificate: u...@yyy.com
> * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
> * Server certificate: *.yyy.com
> * Server certificate: Entrust Certification Authority - L1K
> * Server certificate: Entrust Root Certification Authority - G2
> > GET /history/20151030-160604-3039174572-5951-22401-0004 HTTP/1.1
> > Host: xxx.yyy.com:8775
> > User-Agent: curl/7.43.0
> > Accept: */*
> >
> < HTTP/1.1 302 Found
> < Location: 
> http://xxx.yyy.com:8775/history/20151030-160604-3039174572-5951-22401-0004
> < Connection: close
> < Server: Jetty(8.y.z-SNAPSHOT)
> <
> * Closing connection 0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23103) LevelDB store not iterating correctly when indexed value has negative value

2018-01-19 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-23103.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20284
[https://github.com/apache/spark/pull/20284]

> LevelDB store not iterating correctly when indexed value has negative value
> ---
>
> Key: SPARK-23103
> URL: https://issues.apache.org/jira/browse/SPARK-23103
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> Marking as minor since I don't believe we currently have anything that needs 
> to store negative values in indexed fields. But I wrote a unit test and got:
>  
> {noformat}
> [error] Test 
> org.apache.spark.util.kvstore.LevelDBSuite.testNegativeIndexValues failed: 
> java.lang.AssertionError: expected:<[-50, 0, 50]> but was:<[[0, -50, 50]]>, 
> took 0.025 sec
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23103) LevelDB store not iterating correctly when indexed value has negative value

2018-01-19 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-23103:


Assignee: Marcelo Vanzin

> LevelDB store not iterating correctly when indexed value has negative value
> ---
>
> Key: SPARK-23103
> URL: https://issues.apache.org/jira/browse/SPARK-23103
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> Marking as minor since I don't believe we currently have anything that needs 
> to store negative values in indexed fields. But I wrote a unit test and got:
>  
> {noformat}
> [error] Test 
> org.apache.spark.util.kvstore.LevelDBSuite.testNegativeIndexValues failed: 
> java.lang.AssertionError: expected:<[-50, 0, 50]> but was:<[[0, -50, 50]]>, 
> took 0.025 sec
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20664) Remove stale applications from SHS listing

2018-01-19 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20664.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20138
[https://github.com/apache/spark/pull/20138]

> Remove stale applications from SHS listing
> --
>
> Key: SPARK-20664
> URL: https://issues.apache.org/jira/browse/SPARK-20664
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task is actually not explicit in the spec, and it's also an issue with 
> the current SHS. But having the SHS persist listing data makes it worse.
> Basically, the SHS currently does not detect when files are deleted from the 
> event log directory manually; so those applications are still listed, and 
> trying to see their UI will either show the UI (if it's loaded) or an error 
> (if it's not).
> With the new SHS, that also means that data is leaked in the disk stores used 
> to persist listing and UI data, making the problem worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20664) Remove stale applications from SHS listing

2018-01-19 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20664:


Assignee: Marcelo Vanzin

> Remove stale applications from SHS listing
> --
>
> Key: SPARK-20664
> URL: https://issues.apache.org/jira/browse/SPARK-20664
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task is actually not explicit in the spec, and it's also an issue with 
> the current SHS. But having the SHS persist listing data makes it worse.
> Basically, the SHS currently does not detect when files are deleted from the 
> event log directory manually; so those applications are still listed, and 
> trying to see their UI will either show the UI (if it's loaded) or an error 
> (if it's not).
> With the new SHS, that also means that data is leaked in the disk stores used 
> to persist listing and UI data, making the problem worse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22360) Add unit test for Window Specifications

2018-01-19 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332773#comment-16332773
 ] 

Jiang Xingbo commented on SPARK-22360:
--

Created https://issues.apache.org/jira/browse/SPARK-23160

> Add unit test for Window Specifications
> ---
>
> Key: SPARK-22360
> URL: https://issues.apache.org/jira/browse/SPARK-22360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> * different partition clauses (none, one, multiple)
> * different order clauses (none, one, multiple, asc/desc, nulls first/last)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23160) Add more window sql tests

2018-01-19 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-23160:


 Summary: Add more window sql tests
 Key: SPARK-23160
 URL: https://issues.apache.org/jira/browse/SPARK-23160
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Jiang Xingbo


We should also cover the window sql interface, example in 
`sql/core/src/test/resources/sql-tests/inputs/window.sql`, it should also be 
funny to see whether we can generate consistent results for window tests in 
other major databases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22360) Add unit test for Window Specifications

2018-01-19 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332762#comment-16332762
 ] 

Jiang Xingbo commented on SPARK-22360:
--

Sorry for late response. It's great that we can cover the DataFrame test cases, 
I really think we should have them soon. Besides, we should also cover the 
window sql interface, example in 
`sql/core/src/test/resources/sql-tests/inputs/window.sql`, it should also be 
funny to see whether we can generate consistent results for window tests in 
other major databases.

[~smilegator] WDYT?

> Add unit test for Window Specifications
> ---
>
> Key: SPARK-22360
> URL: https://issues.apache.org/jira/browse/SPARK-22360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> * different partition clauses (none, one, multiple)
> * different order clauses (none, one, multiple, asc/desc, nulls first/last)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23138) Add user guide example for multiclass logistic regression summary

2018-01-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332754#comment-16332754
 ] 

Apache Spark commented on SPARK-23138:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/20332

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23138) Add user guide example for multiclass logistic regression summary

2018-01-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23138:


Assignee: (was: Apache Spark)

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23138) Add user guide example for multiclass logistic regression summary

2018-01-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23138:


Assignee: Apache Spark

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23159) Update Cloudpickle to match version 0.4.2

2018-01-19 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332752#comment-16332752
 ] 

Bryan Cutler commented on SPARK-23159:
--

I can work on this

> Update Cloudpickle to match version 0.4.2
> -
>
> Key: SPARK-23159
> URL: https://issues.apache.org/jira/browse/SPARK-23159
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Update PySpark's version of Cloudpickle to match version 0.4.2.  The reasons 
> for doing this are:
>  * Pick up bug fixes, improvements with newer version
>  * Match a specific version as close as possible (Spark has additional 
> changes that might be necessary) to make future upgrades easier
> There are newer versions of Cloudpickle that can fix bugs with NamedTuple 
> pickling (that Spark currently has workarounds for), but these include other 
> changes that need some verification before bringing into Spark.  Upgrading 
> first to 0.4.2 will help make this verification easier.
> Discussion on the mailing list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23159) Update Cloudpickle to match version 0.4.2

2018-01-19 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-23159:


 Summary: Update Cloudpickle to match version 0.4.2
 Key: SPARK-23159
 URL: https://issues.apache.org/jira/browse/SPARK-23159
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler


Update PySpark's version of Cloudpickle to match version 0.4.2.  The reasons 
for doing this are:
 * Pick up bug fixes, improvements with newer version
 * Match a specific version as close as possible (Spark has additional changes 
that might be necessary) to make future upgrades easier

There are newer versions of Cloudpickle that can fix bugs with NamedTuple 
pickling (that Spark currently has workarounds for), but these include other 
changes that need some verification before bringing into Spark.  Upgrading 
first to 0.4.2 will help make this verification easier.

Discussion on the mailing list: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Thoughts-on-Cloudpickle-Update-td23188.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-19 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332703#comment-16332703
 ] 

Bryan Cutler commented on SPARK-23109:
--

[~josephkb] the image module is missing many of the get* methods that are in 
Scala - is it meant to have an equivalent API or is the usage a little 
different?

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-19 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332698#comment-16332698
 ] 

Bryan Cutler commented on SPARK-23109:
--

I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23137) spark.kubernetes.executor.podNamePrefix is ignored

2018-01-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23137.

   Resolution: Fixed
 Assignee: Anirudh Ramanathan
Fix Version/s: 2.3.0

> spark.kubernetes.executor.podNamePrefix is ignored
> --
>
> Key: SPARK-23137
> URL: https://issues.apache.org/jira/browse/SPARK-23137
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
> Fix For: 2.3.0
>
>
> [~liyinan926] is fixing this as we speak. Should be a very minor change.
> It's also a non-critical option, so, if we decide that the safer thing is to 
> just remove it, we can do that as well. Will leave that decision to the 
> release czar and reviewers.
>  
> [~vanzin] [~felixcheung] [~sameerag]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23104) Document that kubernetes is still "experimental"

2018-01-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23104.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20314
[https://github.com/apache/spark/pull/20314]

> Document that kubernetes is still "experimental"
> 
>
> Key: SPARK-23104
> URL: https://issues.apache.org/jira/browse/SPARK-23104
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Anirudh Ramanathan
>Priority: Critical
> Fix For: 2.3.0
>
>
> As discussed in the mailing list, we should document that the kubernetes 
> backend is still experimental.
> That does not need to include any code changes. This is just meant to tell 
> users that they can expect changes in how the backend behaves in future 
> versions, and that things like configuration, the container image's layout 
> and entry points might change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16617) Upgrade to Avro 1.8.x

2018-01-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-16617:
--

Assignee: (was: Marcelo Vanzin)

> Upgrade to Avro 1.8.x
> -
>
> Key: SPARK-16617
> URL: https://issues.apache.org/jira/browse/SPARK-16617
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 2.1.0
>Reporter: Ben McCann
>Priority: Major
>
> Avro 1.8 makes Avro objects serializable so that you can easily have an RDD 
> containing Avro objects.
> See https://issues.apache.org/jira/browse/AVRO-1502



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16617) Upgrade to Avro 1.8.x

2018-01-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-16617:
--

Assignee: Marcelo Vanzin

> Upgrade to Avro 1.8.x
> -
>
> Key: SPARK-16617
> URL: https://issues.apache.org/jira/browse/SPARK-16617
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 2.1.0
>Reporter: Ben McCann
>Assignee: Marcelo Vanzin
>Priority: Major
>
> Avro 1.8 makes Avro objects serializable so that you can easily have an RDD 
> containing Avro objects.
> See https://issues.apache.org/jira/browse/AVRO-1502



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23158) Move HadoopFsRelationTest test suites to from sql/hive to sql/core

2018-01-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23158:


Assignee: Xiao Li  (was: Apache Spark)

> Move HadoopFsRelationTest test suites to from sql/hive to sql/core
> --
>
> Key: SPARK-23158
> URL: https://issues.apache.org/jira/browse/SPARK-23158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23158) Move HadoopFsRelationTest test suites to from sql/hive to sql/core

2018-01-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332570#comment-16332570
 ] 

Apache Spark commented on SPARK-23158:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20331

> Move HadoopFsRelationTest test suites to from sql/hive to sql/core
> --
>
> Key: SPARK-23158
> URL: https://issues.apache.org/jira/browse/SPARK-23158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23158) Move HadoopFsRelationTest test suites to from sql/hive to sql/core

2018-01-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23158:


Assignee: Apache Spark  (was: Xiao Li)

> Move HadoopFsRelationTest test suites to from sql/hive to sql/core
> --
>
> Key: SPARK-23158
> URL: https://issues.apache.org/jira/browse/SPARK-23158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23158) Move HadoopFsRelationTest test suites to from sql/hive to sql/core

2018-01-19 Thread Xiao Li (JIRA)

Xiao Li created SPARK-23158:
---

 Summary: Move HadoopFsRelationTest test suites to from sql/hive to 
sql/core
 Key: SPARK-23158
 URL: https://issues.apache.org/jira/browse/SPARK-23158
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23149) polish ColumnarBatch

2018-01-19 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23149.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> polish ColumnarBatch
> 
>
> Key: SPARK-23149
> URL: https://issues.apache.org/jira/browse/SPARK-23149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet

2018-01-19 Thread Tomasz Bartczak (JIRA)

Tomasz Bartczak created SPARK-23157:
---

 Summary: withColumn fails for a column that is a result of mapped 
DataSet
 Key: SPARK-23157
 URL: https://issues.apache.org/jira/browse/SPARK-23157
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Tomasz Bartczak


Having 

{code:java}
case class R(id: String)
val ds = spark.createDataset(Seq(R("1")))
{code}

This works:
{code}
scala> ds.withColumn("n", ds.col("id"))
res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
{code}

but when we map over ds it fails:
{code}
scala> ds.withColumn("n", ds.map(a => a).col("id"))
org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
from id#4 in operator !Project [id#4, id#55 AS n#57];;
!Project [id#4, id#55 AS n#57]
+- LocalRelation [id#4]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
  at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
  at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
  ... 48 elided
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-19 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332434#comment-16332434
 ] 

Marco Gaido commented on SPARK-23156:
-

[~kzawisto] a lot of work on this has been done and it is both on 2.2 
maintenance versions and some more will be in 2.3 (too many tickets to list). 
Please try to reproduce on current master, but I am quite sure this is a 
duplicate of many similar tickets and it will work. Thanks.

>  Code of method "initialize(I)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-23156
> URL: https://issues.apache.org/jira/browse/SPARK-23156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.1.1, 2.1.2
> Environment: Ubuntu 16.04, Scala 2.11, Java 8, 8-node YARN cluster.
>Reporter: Krystian Zawistowski
>Priority: Major
>
> I am getting this trying to generate a random DataFrame  (300 columns, 5000 
> rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not 
> identical) to SPARK-18492 and few tickets more that should be done in 2.1.1.
> Part of the logs below. They contain hundreds of millions of lines of 
> generated code, apparently for each of the 1500 000 fields of the dataframe 
> which is very suspicious. 
> {code:java}
> 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$
> 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" 
> of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB$
> /* 001 */ public java.lang.Object generate(Object[] references) {$
> /* 002 */ return new SpecificUnsafeProjection(references);$
> /* 003 */ }$
> /* 004 */$
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$
> /* 006 */$
> /* 007 */ private Object[] references;$
> /* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$
> /* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$
> /* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$
> /* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$
> /* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$
> {code}
> Reproduction:
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends 
> Serializable {
>   private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime
>   private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime
>   val idColumn = "id"
>   import org.apache.spark.sql.functions._
> def generateData(path: String): Unit = {
>   val spark: SparkSession = SparkSession.builder().getOrCreate()
>   materializeTable(spark).write.parquet(path + "/source")
> }
> private def materializeTable(spark: SparkSession): DataFrame = {
>   var sourceDF = spark.sqlContext.range(0, 
> numberOfRows).withColumnRenamed("id", 
>  idColumn)
>   val columns = sourceDF(idColumn) +: (0 until numberOfColumns)
>   .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), 
> getCategoryColumn(x)))
> sourceDF.select(columns: _*)
> }
> private def getTimeColumn(seed: Int): Column = {
>   val uniqueSeed = seed + numberOfColumns * 3
>   rand(seed = uniqueSeed)
>.multiply(maxEpoch - minEpoch)
>.divide(1000).cast("long")
>.plus(minEpoch / 1000).cast(TimestampType).alias(s"time$seed")
> }
> private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column 
> = {
>   val uniqueSeed = seed + numberOfColumns * 4
>   randn(seed = uniqueSeed).alias(s"$namePrefix$seed")
> }
> private def getCategoryColumn(seed: Int): Column = {
>   val uniqueSeed = seed + numberOfColumns * 4
>   rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed")
> }
> }
> object GenerateData{
> def main(args: Array[String]): Unit = {
>   new RandomData(args(0).toInt, args(1).toInt).generateData(args(2))
> }
> }
> {code}
> Please package a jar and run as follows:
> {code:java}
> spark-submit --master yarn \
>  --driver-memory 12g \
>  --executor-memory 12g \
>  --deploy-mode cluster \
>  --class GenerateData \
>  --master yarn \
>  100 5000 "hdfs:///tmp/parquet"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse

2018-01-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23085.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20275
[https://github.com/apache/spark/pull/20275]

> API parity for mllib.linalg.Vectors.sparse 
> ---
>
> Key: SPARK-23085
> URL: https://issues.apache.org/jira/browse/SPARK-23085
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.4.0
>
>
> Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: 
> {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support 
> zero-length vectors.
> In old MLLib,
> {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}])}} also supports it.
> However,
> {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a 
> positve length.
>  
> {code:java}
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res15: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> res16: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> java.lang.IllegalArgumentException: requirement failed: The size of the 
> requested sparse vector must be greater than 0.
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
>   ... 50 elided
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse

2018-01-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23085:
-

Assignee: zhengruifeng

> API parity for mllib.linalg.Vectors.sparse 
> ---
>
> Key: SPARK-23085
> URL: https://issues.apache.org/jira/browse/SPARK-23085
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.4.0
>
>
> Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: 
> {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support 
> zero-length vectors.
> In old MLLib,
> {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}])}} also supports it.
> However,
> {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a 
> positve length.
>  
> {code:java}
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res15: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> res16: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> java.lang.IllegalArgumentException: requirement failed: The size of the 
> requested sparse vector must be greater than 0.
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
>   ... 50 elided
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-19 Thread Krystian Zawistowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krystian Zawistowski updated SPARK-23156:
-
Description: 
I am getting this trying to generate a random DataFrame  (300 columns, 5000 
rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not 
identical) to SPARK-18492 and few tickets more that should be done in 2.1.1.

Part of the logs below. They contain hundreds of millions of lines of generated 
code, apparently for each of the 1500 000 fields of the dataframe which is very 
suspicious. 
{code:java}
18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$
18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of 
class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB$
/* 001 */ public java.lang.Object generate(Object[] references) {$
/* 002 */ return new SpecificUnsafeProjection(references);$
/* 003 */ }$
/* 004 */$
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$
/* 006 */$
/* 007 */ private Object[] references;$
/* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$
/* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$
/* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$
/* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$
/* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$
{code}
Reproduction:
{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends 
Serializable {
  private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime
  private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime
  val idColumn = "id"
  import org.apache.spark.sql.functions._
def generateData(path: String): Unit = {
  val spark: SparkSession = SparkSession.builder().getOrCreate()
  materializeTable(spark).write.parquet(path + "/source")
}

private def materializeTable(spark: SparkSession): DataFrame = {
  var sourceDF = spark.sqlContext.range(0, 
numberOfRows).withColumnRenamed("id", 
 idColumn)
  val columns = sourceDF(idColumn) +: (0 until numberOfColumns)
  .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), getCategoryColumn(x)))
sourceDF.select(columns: _*)
}

private def getTimeColumn(seed: Int): Column = {
  val uniqueSeed = seed + numberOfColumns * 3
  rand(seed = uniqueSeed)
   .multiply(maxEpoch - minEpoch)
   .divide(1000).cast("long")
   .plus(minEpoch / 1000).cast(TimestampType).alias(s"time$seed")
}
private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column = 
{
  val uniqueSeed = seed + numberOfColumns * 4
  randn(seed = uniqueSeed).alias(s"$namePrefix$seed")
}
private def getCategoryColumn(seed: Int): Column = {
  val uniqueSeed = seed + numberOfColumns * 4
  rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed")
}

}

object GenerateData{
def main(args: Array[String]): Unit = {
  new RandomData(args(0).toInt, args(1).toInt).generateData(args(2))
}
}

{code}
Please package a jar and run as follows:
{code:java}
spark-submit --master yarn \
 --driver-memory 12g \
 --executor-memory 12g \
 --deploy-mode cluster \
 --class GenerateData \
 --master yarn \
 100 5000 "hdfs:///tmp/parquet"
{code}

  was:
I am getting this trying to generate a random DataFrame  (300 columns, 5000 
rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not 
identical) to SPARK-18492 and few tickets more that should be done in 2.1.1.

Part of the logs below. They contain hundreds of millions of lines of generated 
code, apparently for each of the 1500 000 fields of the dataframe which is very 
suspicious. 
{code:java}
18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$
18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of 
class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB$
/* 001 */ public java.lang.Object generate(Object[] references) {$
/* 002 */ return new SpecificUnsafeProjection(references);$
/* 003 */ }$
/* 004 */$
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$
/* 006 */$
/* 007 */ private Object[] references;$
/* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$
/* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$
/* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$
/* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$
/* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$
{code}
Reproduction:

[jira] [Updated] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-19 Thread Krystian Zawistowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krystian Zawistowski updated SPARK-23156:
-
Description: 
I am getting this trying to generate a random DataFrame  (300 columns, 5000 
rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not 
identical) to SPARK-18492 and few tickets more that should be done in 2.1.1.

Part of the logs below. They contain hundreds of millions of lines of generated 
code, apparently for each of the 1500 000 fields of the dataframe which is very 
suspicious. 
{code:java}
18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$
18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of 
class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB$
/* 001 */ public java.lang.Object generate(Object[] references) {$
/* 002 */ return new SpecificUnsafeProjection(references);$
/* 003 */ }$
/* 004 */$
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$
/* 006 */$
/* 007 */ private Object[] references;$
/* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$
/* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$
/* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$
/* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$
/* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$
{code}
Reproduction:
{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends 
Serializable {
  private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime
  private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime
  val idColumn = "id"
  import org.apache.spark.sql.functions._
def generateData(path: String): Unit = {
  val spark: SparkSession = SparkSession.builder().getOrCreate()
  materializeTable(spark).write.parquet(path + "/source")
}

private def materializeTable(spark: SparkSession): DataFrame = {
  var sourceDF = spark.sqlContext.range(0, 
numberOfRows).withColumnRenamed("id", 
 idColumn)
  val columns = sourceDF(idColumn) +: (0 until numberOfColumns)
  .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), getCategoryColumn(x)))
sourceDF.select(columns: _*)
}

private def getTimeColumn(seed: Int): Column = {
  val uniqueSeed = seed + numberOfColumns * 3
  rand(seed = uniqueSeed).multiply(maxEpoch - 
minEpoch).divide(1000).cast("long").plus(minEpoch / 
1000).cast(TimestampType).alias(s"time$seed")
}
private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column = 
{
  val uniqueSeed = seed + numberOfColumns * 4
  randn(seed = uniqueSeed).alias(s"$namePrefix$seed")
}
private def getCategoryColumn(seed: Int): Column = {
  val uniqueSeed = seed + numberOfColumns * 4
  rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed")
}

}

object GenerateData{
def main(args: Array[String]): Unit = {
  new RandomData(args(0).toInt, args(1).toInt).generateData(args(2))
}
}

{code}
Please package a jar and run as follows:
{code:java}
spark-submit --master yarn \
 --driver-memory 12g \
 --executor-memory 12g \
 --deploy-mode cluster \
 --class GenerateData \
 --master yarn \
 100 5000 "hdfs:///tmp/parquet"
{code}

  was:
I am getting this trying to generate a random DataFrame  (300 columns, 5000 
rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not 
identical) to SPARK-18492 and few tickets more that should be done in 2.1.1.

Part of the logs below. They contain hundreds of millions of lines of generated 
code, apparently for each of the 1500 000 fields of the dataframe which is very 
suspicious. 
{code:java}
18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$
18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of 
class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB$
/* 001 */ public java.lang.Object generate(Object[] references) {$
/* 002 */ return new SpecificUnsafeProjection(references);$
/* 003 */ }$
/* 004 */$
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$
/* 006 */$
/* 007 */ private Object[] references;$
/* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$
/* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$
/* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$
/* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$
/* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$
{code}
Reproduction:
{code:java}

[jira] [Updated] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-19 Thread Krystian Zawistowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krystian Zawistowski updated SPARK-23156:
-
Description: 
I am getting this trying to generate a random DataFrame  (300 columns, 5000 
rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not 
identical) to SPARK-18492 and few tickets more that should be done in 2.1.1.

Part of the logs below. They contain hundreds of millions of lines of generated 
code, apparently for each of the 1500 000 fields of the dataframe which is very 
suspicious. 
{code:java}
18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$
18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of 
class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB$
/* 001 */ public java.lang.Object generate(Object[] references) {$
/* 002 */ return new SpecificUnsafeProjection(references);$
/* 003 */ }$
/* 004 */$
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$
/* 006 */$
/* 007 */ private Object[] references;$
/* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$
/* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$
/* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$
/* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$
/* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$
{code}
Reproduction:
{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends 
Serializable {
  private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime
  private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime
  val idColumn = "id"
  import org.apache.spark.sql.functions._
  def generateFeatureLearningData(path: String): Unit = {
  val spark: SparkSession = SparkSession.builder().getOrCreate()
  materializeSourceFeatureLearningTable(spark).write.parquet(path + "/source")
  materializeTargetTable(spark).write.parquet(path + "/target")
}

def generateModelLearningData(path: String): Unit = {
  val spark: SparkSession = SparkSession.builder().getOrCreate()
  materializeTargetTable(spark).write.parquet(path + "/target")
  materializeSourceModelLearningTable(spark).write.parquet(path + "/source")
}

private def materializeSourceFeatureLearningTable(spark: SparkSession): 
DataFrame = {
  var sourceDF = spark.sqlContext.range(0, 
numberOfRows).withColumnRenamed("id", 
 idColumn)
  val columns = sourceDF(idColumn) +: (0 until numberOfColumns)
  .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), getCategoryColumn(x)))
sourceDF.select(columns: _*)
}

private def getTimeColumn(seed: Int): Column = {
  val uniqueSeed = seed + numberOfColumns * 3
  rand(seed = uniqueSeed).multiply(maxEpoch - 
minEpoch).divide(1000).cast("long").plus(minEpoch / 
1000).cast(TimestampType).alias(s"time$seed")
}
private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column = 
{
  val uniqueSeed = seed + numberOfColumns * 4
  randn(seed = uniqueSeed).alias(s"$namePrefix$seed")
}
private def getCategoryColumn(seed: Int): Column = {
  val uniqueSeed = seed + numberOfColumns * 4
  rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed")
}

}

object GenerateData{
def main(args: Array[String]): Unit = {
  new RandomData(args(0).toInt, 
args(1).toInt).generateFeatureLearningData(args(2))
}
}

{code}
Please package a jar and run as follows:
{code:java}
spark-submit --master yarn \
 --driver-memory 12g \
 --executor-memory 12g \
 --deploy-mode cluster \
 --class GenerateData \
 --master yarn \
 100 5000 "hdfs:///tmp/parquet"
{code}

  was:
I am getting this trying to generate a random DataFrame  (300 columns, 5000 
rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not 
identical) to SPARK-18492 and few tickets more that should be done in 2.1.1.

Part of the logs below. They contain hundreds of millions of lines of generated 
code, apparently for each of the 1500 000 fields of the dataframe which is very 
suspicious. 
{code:java}
18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$
18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of 
class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB$
/* 001 */ public java.lang.Object generate(Object[] references) {$
/* 002 */ return new SpecificUnsafeProjection(references);$
/* 003 */ }$
/* 004 */$
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$
/* 006 */$
/* 007 */ private

[jira] [Created] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-19 Thread Krystian Zawistowski (JIRA)

Krystian Zawistowski created SPARK-23156:


 Summary:  Code of method "initialize(I)V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
 Key: SPARK-23156
 URL: https://issues.apache.org/jira/browse/SPARK-23156
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit, SQL
Affects Versions: 2.1.2, 2.1.1
 Environment: Ubuntu 16.04, Scala 2.11, Java 8, 8-node YARN cluster.
Reporter: Krystian Zawistowski


I am getting this trying to generate a random DataFrame  (300 columns, 5000 
rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not 
identical) to SPARK-18492 and few tickets more that should be done in 2.1.1.

Part of the logs below. They contain hundreds of millions of lines of generated 
code, apparently for each of the 1500 000 fields of the dataframe which is very 
suspicious. 
{code:java}
18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$
18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" of 
class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB$
/* 001 */ public java.lang.Object generate(Object[] references) {$
/* 002 */ return new SpecificUnsafeProjection(references);$
/* 003 */ }$
/* 004 */$
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$
/* 006 */$
/* 007 */ private Object[] references;$
/* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$
/* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$
/* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$
/* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$
/* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$
{code}
Reproduction:
{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends 
Serializable {


private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime
private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime
val idColumn = "id"
import org.apache.spark.sql.functions._
def generateFeatureLearningData(path: String): Unit = {
val spark: SparkSession = SparkSession.builder().getOrCreate()
materializeSourceFeatureLearningTable(spark).write.parquet(path + "/source")
materializeTargetTable(spark).write.parquet(path + "/target")
}

def generateModelLearningData(path: String): Unit = {
val spark: SparkSession = SparkSession.builder().getOrCreate()
materializeTargetTable(spark).write.parquet(path + "/target")
materializeSourceModelLearningTable(spark).write.parquet(path + "/source")
}

private def materializeSourceFeatureLearningTable(spark: SparkSession): 
DataFrame = {
var sourceDF = spark.sqlContext.range(0, numberOfRows).withColumnRenamed("id", 
idColumn)
val columns = sourceDF(idColumn) +: (0 until numberOfColumns)
.flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), getCategoryColumn(x)))
sourceDF.select(columns: _*)
}

private def getTimeColumn(seed: Int): Column = {
val uniqueSeed = seed + numberOfColumns * 3
rand(seed = uniqueSeed).multiply(maxEpoch - 
minEpoch).divide(1000).cast("long").plus(minEpoch / 
1000).cast(TimestampType).alias(s"time$seed")
}
private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column = 
{
val uniqueSeed = seed + numberOfColumns * 4
randn(seed = uniqueSeed).alias(s"$namePrefix$seed")
}
private def getCategoryColumn(seed: Int): Column = {
val uniqueSeed = seed + numberOfColumns * 4
rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed")
}

}

object GenerateData{
def main(args: Array[String]): Unit = {
new RandomData(args(0).toInt, 
args(1).toInt).generateFeatureLearningData(args(2))
}
}

{code}
Please package a jar and run as follows:
{code}
spark-submit --master yarn --driver-memory 12g  --executor-memory 12g 
--deploy-mode cluster --class GenerateData --master yarn  100 5000 
"hdfs:///tmp/parquet"
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22577) executor page blacklist status should update with TaskSet level blacklisting

2018-01-19 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-22577:
---
Attachment: node_blacklisting_for_stage.png

> executor page blacklist status should update with TaskSet level blacklisting
> 
>
> Key: SPARK-22577
> URL: https://issues.apache.org/jira/browse/SPARK-22577
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Priority: Major
> Attachments: app_blacklisting.png, node_blacklisting_for_stage.png, 
> stage_blacklisting.png
>
>
> right now the executor blacklist status only updates with the 
> BlacklistTracker after a task set has finished and propagated the 
> blacklisting to the application level. We should change that to show at the 
> taskset level as well. Without this it can be very confusing to the user why 
> things aren't running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23024) Spark ui about the contents of the form need to have hidden and show features, when the table records very much.

2018-01-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23024:
-

Assignee: guoxiaolongzte

> Spark ui about the contents of the form need to have hidden and show 
> features, when the table records very much. 
> -
>
> Key: SPARK-23024
> URL: https://issues.apache.org/jira/browse/SPARK-23024
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: 1.png, 2.png
>
>
> Spark ui about the contents of the form need to have hidden and show 
> features, when the table records very much. Because sometimes you do not care 
> about the record of the table, you just want to see the contents of the next 
> table, but you have to scroll the scroll bar for a long time to see the 
> contents of the next table.
> Currently we have about 500 workers, but I just wanted to see the logs for 
> the running applications table. I had to scroll through the scroll bars for a 
> long time to see the logs for the running applications table.
> In order to ensure functional consistency, I modified the Master Page, Worker 
> Page, Job Page, Stage Page, Task Page, Configuration Page, Storage Page, Pool 
> Page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23024) Spark ui about the contents of the form need to have hidden and show features, when the table records very much.

2018-01-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23024.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20216
[https://github.com/apache/spark/pull/20216]

> Spark ui about the contents of the form need to have hidden and show 
> features, when the table records very much. 
> -
>
> Key: SPARK-23024
> URL: https://issues.apache.org/jira/browse/SPARK-23024
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: 1.png, 2.png
>
>
> Spark ui about the contents of the form need to have hidden and show 
> features, when the table records very much. Because sometimes you do not care 
> about the record of the table, you just want to see the contents of the next 
> table, but you have to scroll the scroll bar for a long time to see the 
> contents of the next table.
> Currently we have about 500 workers, but I just wanted to see the logs for 
> the running applications table. I had to scroll through the scroll bars for a 
> long time to see the logs for the running applications table.
> In order to ensure functional consistency, I modified the Master Page, Worker 
> Page, Job Page, Stage Page, Task Page, Configuration Page, Storage Page, Pool 
> Page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be

2018-01-19 Thread Sandor Murakozi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332300#comment-16332300
 ] 

Sandor Murakozi commented on SPARK-23121:
-

One issue is with displaying old jobs. Depending on how old a job is it may or 
may not be displayed correctly.

The bigger issue is that the main jobs page can also be affected. 

> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
> -
>
> Key: SPARK-23121
> URL: https://issues.apache.org/jira/browse/SPARK-23121
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
>  
> Test command:
> ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount 
> ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark
>  
> The app is running for a period of time,  ui can not be accessed, please see 
> attachment.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be

2018-01-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332294#comment-16332294
 ] 

Sean Owen commented on SPARK-23121:
---

Yes that sounds right. But doesn't it just cause an error when displaying pages 
for old jobs? it would be an 'error' of some kind no matter what, whether a 404 
or "not found" message. It can be improved but didn't sound like it mattered 
beyond that.

> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
> -
>
> Key: SPARK-23121
> URL: https://issues.apache.org/jira/browse/SPARK-23121
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
>  
> Test command:
> ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount 
> ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark
>  
> The app is running for a period of time,  ui can not be accessed, please see 
> attachment.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-19 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332252#comment-16332252
 ] 

Nick Pentreath commented on SPARK-23154:


SGTM

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be

2018-01-19 Thread Sandor Murakozi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332249#comment-16332249
 ] 

Sandor Murakozi commented on SPARK-23121:
-

[~guoxiaolongzte] found two separate problems, bot triggered by having a high 
number of jobs/stages. In such a situation the store of the history server 
drops various objects to save memory. It may happen that the job itself is in 
the store, but its stages or the RDDOperationGraph are not. In such cases 
rendering of the all jobs and the job pages fails.

As a consequence, the jobs page may become inaccessible if the cluster 
processes many jobs, so I think the priority of this issue should be increased.

What do you think [~srowen] ?

> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
> -
>
> Key: SPARK-23121
> URL: https://issues.apache.org/jira/browse/SPARK-23121
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
>  
> Test command:
> ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount 
> ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark
>  
> The app is running for a period of time,  ui can not be accessed, please see 
> attachment.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23121) When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be

2018-01-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332215#comment-16332215
 ] 

Apache Spark commented on SPARK-23121:
--

User 'smurakozi' has created a pull request for this issue:
https://github.com/apache/spark/pull/20330

> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
> -
>
> Key: SPARK-23121
> URL: https://issues.apache.org/jira/browse/SPARK-23121
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> When the Spark Streaming app is running for a period of time, the page is 
> incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' 
> and ui can not be accessed.
>  
> Test command:
> ./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount 
> ./examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar /spark
>  
> The app is running for a period of time,  ui can not be accessed, please see 
> attachment.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15467) Getting stack overflow when attempting to query a wide Dataset (>200 fields)

2018-01-19 Thread Artem Kalchenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332211#comment-16332211
 ] 

Artem Kalchenko commented on SPARK-15467:
-

I guess I'm still experiencing this issue with Spark 2.2
{noformat}
18/01/19 12:32:28 WARN Utils: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.debug.maxToStringFields' in SparkEnv.conf.
Exception in thread "main" java.lang.StackOverflowError
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541){noformat}

> Getting stack overflow when attempting to query a wide Dataset (>200 fields)
> 
>
> Key: SPARK-15467
> URL: https://issues.apache.org/jira/browse/SPARK-15467
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Don Drake
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.1.0
>
>
> This can be duplicated in a spark-shell, I am running Spark 2.0.0-preview.
> {code}
> import spark.implicits._
> case class Wide(
> val f0:String = "",
> val f1:String = "",
> val f2:String = "",
> val f3:String = "",
> val f4:String = "",
> val f5:String = "",
> val f6:String = "",
> val f7:String = "",
> val f8:String = "",
> val f9:String = "",
> val f10:String = "",
> val f11:String = "",
> val f12:String = "",
> val f13:String = "",
> val f14:String = "",
> val f15:String = "",
> val f16:String = "",
> val f17:String = "",
> val f18:String = "",
> val f19:String = "",
> val f20:String = "",
> val f21:String = "",
> val f22:String = "",
> val f23:String = "",
> val f24:String = "",
> val f25:String = "",
> val f26:String = "",
> val f27:String = "",
> val f28:String = "",
> val f29:String = "",
> val f30:String = "",
> val f31:String = "",
> val f32:String = "",
> val f33:String = "",
> val f34:String = "",
> val f35:String = "",
> val f36:String = "",
> val f37:String = "",
> val f38:String = "",
> val f39:String = "",
> val f40:String = "",
> val f41:String = "",
> val f42:String = "",
> val f43:String = "",
> val f44:String = "",
> val f45:String = "",
> val f46:String = "",
> val f47:String = "",
> val f48:String = "",
> val f49:String = "",
> val f50:String = "",
> val f51:String = "",
> val f52:String = "",
> val f53:String = "",
> val f54:String = "",
> val f55:String = "",
> val f56:String = "",
> val f57:String = "",
> val f58:String = "",
> val f59:String = "",
> val f60:String = "",
> val f61:String = "",
> val f62:String = "",
> val f63:String = "",
> val f64:String = "",
> val f65:String = "",
> val f66:String = "",
> val f67:String = "",
> val f68:String = "",
> val f69:String = "",
> val f70:String = "",
> val f71:String = "",
> val f72:String = "",
> val f73:String = "",
> val f74:String = "",
> val f75:String = "",
> val f76:String = "",
> val f77:String = "",
> val f78:String = "",
> val f79:String = "",
> val f80:String = "",
> val f81:String = "",
> val f82:String = "",
> val f83:String = "",
> val f84:String = "",
> val f85:String = "",
> val f86:String = "",
> val f87:String = "",
> val f88:String = "",
> val f89:String = "",
> val f90:String = "",
> val f91:String = "",
> val f92:String = "",
> val f93:String = "",
> val f94:String = "",
> val f95:String = "",
> val f96:String = "",
> val f97:String = "",
> val f98:String = "",
> val f99:String = "",
> val f100:String = "",
> val f101:String = "",
> val f102:String = "",
> val f103:String = "",
> val f104:String = "",
> val f105:String = "",
> val f106:String = "",
> val f107:String = "",
> val f108:String = "",
> val f109:String = "",
> val f110:String = "",
> val f111:String = "",
> val f112:String = "",
> val f113:String = "",
> val f114:String = "",
> val f115:String = "",
> val f116:String = "",
> val f117:String = "",
> val f118:String = "",
> val f119:String = "",
> val f120:String = "",
> val f121:String = "",
> val f122:String = "",
> val f123:String = "",
> val f124:String = "",
> val f125:String = "",
> val f126:String = "",
> val f127:String = "",
> val f128:String = "",
> val f129:String = "",
> val f130:String = "",
> val f131:String = "",
> val f132:String = "",
> val f133:String = "",
> val f134:String = "",
> val f135:String = "",
> val f136:String = "",
> val f137:String = "",
> val f138:String = "",
> val f139:String = "",
> val f140:String = "",
> val f141:String = "",
> val f142:String = "",
> val f143:String = "",
> val f144:String = "",
> val f145:String = "",
> val f146:String = "",
> val

[jira] [Assigned] (SPARK-23089) "Unable to create operation log session directory" when parent directory not present

2018-01-19 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23089:
---

Assignee: Marco Gaido

> "Unable to create operation log session directory" when parent directory not 
> present
> 
>
> Key: SPARK-23089
> URL: https://issues.apache.org/jira/browse/SPARK-23089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: 
> /usr/hdp/2.6.3.0-235/spark2/jars/spark-hive-thriftserver_2.11-2.2.0.2.6.3.0-235.jar
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.4 (Maipo)
> $ ps aux|grep ^hive.*spark.*thrift
> hive 1468503  0.9  0.5 13319628 1411676 ?Sl   Jan15  10:18 
> /usr/java/default/bin/java -Dhdp.version=2.6.3.0-235 -cp 
> /usr/hdp/current/spark2-thriftserver/conf/:/usr/hdp/current/spark2-thriftserver/jars/*:/usr/hdp/current/hadoop-client/conf/
>  -Xmx2048m org.apache.spark.deploy.SparkSubmit --properties-file 
> /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift 
> JDBC/ODBC Server spark-internal
>Reporter: Sean Roberts
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.3.0
>
>
> When creating a session directory, Thrift should create the parent directory 
> _(i.e. /tmp/hive/operation_logs)_ if it is not present.
> It's common for operators to clean-up old and empty directories in /tmp, or 
> to have tools (systemd-tmpfiles or tmpwatch) that do it automatically.
> This was fixed in HIVE-12262 but not in Spark Thrift as seen by this:
> {code}18/01/15 14:22:49 WARN HiveSessionImpl: Unable to create operation log 
> session directory: 
> /tmp/hive/operation_logs/683a6318-adc4-42c4-b665-11dad14d7ec7{code}
> Resolved by manually creating /tmp/hive/operation_logs/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23089) "Unable to create operation log session directory" when parent directory not present

2018-01-19 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23089.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> "Unable to create operation log session directory" when parent directory not 
> present
> 
>
> Key: SPARK-23089
> URL: https://issues.apache.org/jira/browse/SPARK-23089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: 
> /usr/hdp/2.6.3.0-235/spark2/jars/spark-hive-thriftserver_2.11-2.2.0.2.6.3.0-235.jar
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.4 (Maipo)
> $ ps aux|grep ^hive.*spark.*thrift
> hive 1468503  0.9  0.5 13319628 1411676 ?Sl   Jan15  10:18 
> /usr/java/default/bin/java -Dhdp.version=2.6.3.0-235 -cp 
> /usr/hdp/current/spark2-thriftserver/conf/:/usr/hdp/current/spark2-thriftserver/jars/*:/usr/hdp/current/hadoop-client/conf/
>  -Xmx2048m org.apache.spark.deploy.SparkSubmit --properties-file 
> /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift 
> JDBC/ODBC Server spark-internal
>Reporter: Sean Roberts
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.3.0
>
>
> When creating a session directory, Thrift should create the parent directory 
> _(i.e. /tmp/hive/operation_logs)_ if it is not present.
> It's common for operators to clean-up old and empty directories in /tmp, or 
> to have tools (systemd-tmpfiles or tmpwatch) that do it automatically.
> This was fixed in HIVE-12262 but not in Spark Thrift as seen by this:
> {code}18/01/15 14:22:49 WARN HiveSessionImpl: Unable to create operation log 
> session directory: 
> /tmp/hive/operation_logs/683a6318-adc4-42c4-b665-11dad14d7ec7{code}
> Resolved by manually creating /tmp/hive/operation_logs/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2018-01-19 Thread Lucas Partridge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332135#comment-16332135
 ] 

Lucas Partridge commented on SPARK-7146:


What's the semantic difference between HasFeaturesCol and HasInputCol, please?

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>Priority: Major
> Fix For: 2.3.0
>
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-19 Thread ANDY GUAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332133#comment-16332133
 ] 

ANDY GUAN commented on SPARK-23148:
---

Looks like the same problem with 
[SPARK-21996|https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel].
 I get the problem fixed by making the following change:

=DataSourceScanExec.scala=

line:441

Seq(PartitionedFile(
 partition.values, file.getPath.toUri.toString, 0, file.getLen, hosts)) 

==>

Seq(PartitionedFile(
partition.values, file.getPath.toString, 0, file.getLen, hosts)) 

==

 

Can you help to make a pull request?

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator

2018-01-19 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23048:
--

Assignee: Liang-Chi Hsieh

> Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator 
> ---
>
> Key: SPARK-23048
> URL: https://issues.apache.org/jira/browse/SPARK-23048
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> Since we're deprecating OneHotEncoder, we should update the docs to reference 
> it's replacement, OneHotEncoderEstimator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator

2018-01-19 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23048.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20257
[https://github.com/apache/spark/pull/20257]

> Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator 
> ---
>
> Key: SPARK-23048
> URL: https://issues.apache.org/jira/browse/SPARK-23048
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> Since we're deprecating OneHotEncoder, we should update the docs to reference 
> it's replacement, OneHotEncoderEstimator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-19 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23127.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20293
[https://github.com/apache/spark/pull/20293]

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Major
> Fix For: 2.3.0
>
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-19 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23127:
--

Assignee: Nick Pentreath

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Major
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-19 Thread Bogdan Raducanu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332041#comment-16332041
 ] 

Bogdan Raducanu edited comment on SPARK-23148 at 1/19/18 10:18 AM:
---

I updated the description with manul escape, if that is what you meant


was (Author: bograd):
What do you mean?

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-19 Thread Bogdan Raducanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-23148:

Description: 
Repro code:
{code:java}
spark.range(10).write.csv("/tmp/a b c/a.csv")
spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
10
spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
java.io.FileNotFoundException: File 
file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
 does not exist
{code}

Trying to manually escape fails in a different place:
{code}
spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/tmp/a%20b%20c/a.csv;
  at 
org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
{code}

  was:
Repro code:
{code:java}
spark.range(10).write.csv("/tmp/a b c/a.csv")
spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
10
spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
java.io.FileNotFoundException: File 
file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
 does not exist
{code}


> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-19 Thread Bogdan Raducanu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332041#comment-16332041
 ] 

Bogdan Raducanu commented on SPARK-23148:
-

What do you mean?

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3

2018-01-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331870#comment-16331870
 ] 

Apache Spark commented on SPARK-23000:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20328

> Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
> -
>
> Key: SPARK-23000
> URL: https://issues.apache.org/jira/browse/SPARK-23000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
> The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always 
> failed in hadoop 2.6 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

79 matches

Mail list logo