[jira] [Assigned] (SPARK-11912) ml.feature.PCA minor refactor

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11912:


Assignee: Apache Spark

> ml.feature.PCA minor refactor
> -
>
> Key: SPARK-11912
> URL: https://issues.apache.org/jira/browse/SPARK-11912
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Like SPARK-11852, k is params and we should save it under metadata/ rather 
> than both under data/ and metadata/. We construct mllib.feature.PCAModel 
> inside transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11912) ml.feature.PCA minor refactor

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11912:


Assignee: (was: Apache Spark)

> ml.feature.PCA minor refactor
> -
>
> Key: SPARK-11912
> URL: https://issues.apache.org/jira/browse/SPARK-11912
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Like SPARK-11852, k is params and we should save it under metadata/ rather 
> than both under data/ and metadata/. We construct mllib.feature.PCAModel 
> inside transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11912) ml.feature.PCA minor refactor

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021428#comment-15021428
 ] 

Apache Spark commented on SPARK-11912:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9897

> ml.feature.PCA minor refactor
> -
>
> Key: SPARK-11912
> URL: https://issues.apache.org/jira/browse/SPARK-11912
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Like SPARK-11852, k is params and we should save it under metadata/ rather 
> than both under data/ and metadata/. We construct mllib.feature.PCAModel 
> inside transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11757) Incorrect join output for joining two dataframes loaded from Parquet format

2015-11-22 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021534#comment-15021534
 ] 

Jeff Zhang commented on SPARK-11757:


I tried it on master, seems this issue has been resolved. 

> Incorrect join output for joining two dataframes loaded from Parquet format
> ---
>
> Key: SPARK-11757
> URL: https://issues.apache.org/jira/browse/SPARK-11757
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
> Environment: Python 2.7, Spark 1.5.0, Amazon linux ami 
> https://aws.amazon.com/amazon-linux-ami/2015.03-release-notes/
>Reporter: Petri Kärkäs
>  Labels: dataframe, emr, join, pyspark
>
> Reading in dataframes from Parquet format in s3, and executing a join between 
> them fails when evoked by column name. Works correctly if a join condition is 
> used instead:
> {code:none}
> sqlContext = SQLContext(sc)
> a = sqlContext.read.parquet('s3://path-to-data-a/')
> b = sqlContext.read.parquet('s3://path-to-data-b/')
> # result 0 rows
> c = a.join(b, on='id', how='left_outer')
> c.count() 
> # correct output
> d = a.join(b, a['id']==b['id'], how='left_outer')
> d.count() 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11895) Rename and possibly update DatasetExample in mllib/examples

2015-11-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11895.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9873
[https://github.com/apache/spark/pull/9873]

> Rename and possibly update DatasetExample in mllib/examples
> ---
>
> Key: SPARK-11895
> URL: https://issues.apache.org/jira/browse/SPARK-11895
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
> Fix For: 1.6.0
>
>
> We used the name `Dataset` to refer to `SchemaRDD` in 1.2 in ML pipelines and 
> created this example file. Since `Dataset` has a new meaning in Spark 1.6, we 
> should rename it to avoid confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11902) Unhandled case in VectorAssembler#transform

2015-11-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11902:
--
Assignee: Benjamin Fradet

> Unhandled case in VectorAssembler#transform
> ---
>
> Key: SPARK-11902
> URL: https://issues.apache.org/jira/browse/SPARK-11902
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.2
>Reporter: Benjamin Fradet
>Assignee: Benjamin Fradet
>Priority: Trivial
> Fix For: 1.6.0
>
>
> I noticed that there is an unhandled case in the transform method of 
> VectorAssembler if one of the input columns doesn't have one of the supported 
> type DoubleType, NumericType, BooleanType or VectorUDT. 
> So, if you try to transform a column of StringType you get a cryptic 
> "scala.MatchError: StringType".
> Will submit a PR shortly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11902) Unhandled case in VectorAssembler#transform

2015-11-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11902.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9885
[https://github.com/apache/spark/pull/9885]

> Unhandled case in VectorAssembler#transform
> ---
>
> Key: SPARK-11902
> URL: https://issues.apache.org/jira/browse/SPARK-11902
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.2
>Reporter: Benjamin Fradet
>Priority: Trivial
> Fix For: 1.6.0
>
>
> I noticed that there is an unhandled case in the transform method of 
> VectorAssembler if one of the input columns doesn't have one of the supported 
> type DoubleType, NumericType, BooleanType or VectorUDT. 
> So, if you try to transform a column of StringType you get a cryptic 
> "scala.MatchError: StringType".
> Will submit a PR shortly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11902) Unhandled case in VectorAssembler#transform

2015-11-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11902:
--
Target Version/s: 1.6.0

> Unhandled case in VectorAssembler#transform
> ---
>
> Key: SPARK-11902
> URL: https://issues.apache.org/jira/browse/SPARK-11902
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.2
>Reporter: Benjamin Fradet
>Assignee: Benjamin Fradet
>Priority: Trivial
> Fix For: 1.6.0
>
>
> I noticed that there is an unhandled case in the transform method of 
> VectorAssembler if one of the input columns doesn't have one of the supported 
> type DoubleType, NumericType, BooleanType or VectorUDT. 
> So, if you try to transform a column of StringType you get a cryptic 
> "scala.MatchError: StringType".
> Will submit a PR shortly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11917) Add SQLContext#dropTempTable to PySpark

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11917:


Assignee: Apache Spark

> Add SQLContext#dropTempTable to PySpark
> ---
>
> Key: SPARK-11917
> URL: https://issues.apache.org/jira/browse/SPARK-11917
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> Seems there's no api to drop table in pyspark now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11860) Invalid argument specification for registerFunction [Python]

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021545#comment-15021545
 ] 

Apache Spark commented on SPARK-11860:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9901

> Invalid argument specification for registerFunction [Python]
> 
>
> Key: SPARK-11860
> URL: https://issues.apache.org/jira/browse/SPARK-11860
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 1.5.2
>Reporter: Tristan
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> https://github.com/apache/spark/blob/branch-1.5/python/pyspark/sql/context.py#L171-L178
> Documentation for SQLContext.registerFunction specifies a lambda function as 
> input. This is false (it works fine with non-lambda functions). I believe 
> this is a typo based on the presence of 'samplingRatio' in the parameter docs:
> https://github.com/apache/spark/blob/branch-1.5/python/pyspark/sql/context.py#L178



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11860) Invalid argument specification for registerFunction [Python]

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11860:


Assignee: (was: Apache Spark)

> Invalid argument specification for registerFunction [Python]
> 
>
> Key: SPARK-11860
> URL: https://issues.apache.org/jira/browse/SPARK-11860
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 1.5.2
>Reporter: Tristan
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> https://github.com/apache/spark/blob/branch-1.5/python/pyspark/sql/context.py#L171-L178
> Documentation for SQLContext.registerFunction specifies a lambda function as 
> input. This is false (it works fine with non-lambda functions). I believe 
> this is a typo based on the presence of 'samplingRatio' in the parameter docs:
> https://github.com/apache/spark/blob/branch-1.5/python/pyspark/sql/context.py#L178



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11860) Invalid argument specification for registerFunction [Python]

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11860:


Assignee: Apache Spark

> Invalid argument specification for registerFunction [Python]
> 
>
> Key: SPARK-11860
> URL: https://issues.apache.org/jira/browse/SPARK-11860
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 1.5.2
>Reporter: Tristan
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> https://github.com/apache/spark/blob/branch-1.5/python/pyspark/sql/context.py#L171-L178
> Documentation for SQLContext.registerFunction specifies a lambda function as 
> input. This is false (it works fine with non-lambda functions). I believe 
> this is a typo based on the presence of 'samplingRatio' in the parameter docs:
> https://github.com/apache/spark/blob/branch-1.5/python/pyspark/sql/context.py#L178



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6791) Model export/import for spark.ml: CrossValidator

2015-11-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6791.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9848
[https://github.com/apache/spark/pull/9848]

> Model export/import for spark.ml: CrossValidator
> 
>
> Key: SPARK-6791
> URL: https://issues.apache.org/jira/browse/SPARK-6791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.6.0
>
>
> Updated to be for CrossValidator only



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11912) ml.feature.PCA minor refactor

2015-11-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11912:
--
Target Version/s: 1.6.0

> ml.feature.PCA minor refactor
> -
>
> Key: SPARK-11912
> URL: https://issues.apache.org/jira/browse/SPARK-11912
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Like SPARK-11852, k is params and we should save it under metadata/ rather 
> than both under data/ and metadata/. Refactor the constructor of 
> ml.feature.PCAModel to take only pc but construct mllib.feature.PCAModel 
> inside transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11912) ml.feature.PCA minor refactor

2015-11-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11912:
--
Assignee: Yanbo Liang

> ml.feature.PCA minor refactor
> -
>
> Key: SPARK-11912
> URL: https://issues.apache.org/jira/browse/SPARK-11912
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Like SPARK-11852, k is params and we should save it under metadata/ rather 
> than both under data/ and metadata/. Refactor the constructor of 
> ml.feature.PCAModel to take only pc but construct mllib.feature.PCAModel 
> inside transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11912) ml.feature.PCA minor refactor

2015-11-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11912.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9897
[https://github.com/apache/spark/pull/9897]

> ml.feature.PCA minor refactor
> -
>
> Key: SPARK-11912
> URL: https://issues.apache.org/jira/browse/SPARK-11912
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Like SPARK-11852, k is params and we should save it under metadata/ rather 
> than both under data/ and metadata/. Refactor the constructor of 
> ml.feature.PCAModel to take only pc but construct mllib.feature.PCAModel 
> inside transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11916) Expression TRIM/LTRIM/RTRIM to support specific trim word

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11916:


Assignee: Apache Spark

> Expression TRIM/LTRIM/RTRIM to support specific trim word
> -
>
> Key: SPARK-11916
> URL: https://issues.apache.org/jira/browse/SPARK-11916
> Project: Spark
>  Issue Type: Improvement
>Reporter: Adrian Wang
>Assignee: Apache Spark
>Priority: Minor
>
> supports expressions like `trim('xxxabcxxx', 'x')`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11916) Expression TRIM/LTRIM/RTRIM to support specific trim word

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021606#comment-15021606
 ] 

Apache Spark commented on SPARK-11916:
--

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9902

> Expression TRIM/LTRIM/RTRIM to support specific trim word
> -
>
> Key: SPARK-11916
> URL: https://issues.apache.org/jira/browse/SPARK-11916
> Project: Spark
>  Issue Type: Improvement
>Reporter: Adrian Wang
>Priority: Minor
>
> supports expressions like `trim('xxxabcxxx', 'x')`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11916) Expression TRIM/LTRIM/RTRIM to support specific trim word

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11916:


Assignee: (was: Apache Spark)

> Expression TRIM/LTRIM/RTRIM to support specific trim word
> -
>
> Key: SPARK-11916
> URL: https://issues.apache.org/jira/browse/SPARK-11916
> Project: Spark
>  Issue Type: Improvement
>Reporter: Adrian Wang
>Priority: Minor
>
> supports expressions like `trim('xxxabcxxx', 'x')`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-11-22 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021527#comment-15021527
 ] 

Sen Fang commented on SPARK-2336:
-

I finally took a crack on the hybrid spill tree for kNN and results so far 
appear to be promising. For anyone who is still interested, you can find it as 
a spark package at: https://github.com/saurfang/spark-knn

The implementation is written for ml API and scales well in terms of both 
number of observations and number of vector dimensions. The KNN itself is 
flexible and the package comes with KNNClassifier and KNNRegression for 
(optionally weighted) classification and regression.

There are a few implementation details I am still trying to iron out. Otherwise 
I look forward to benchmark it against other implementations such as KNN-join, 
KD-Tree, and LSH.

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11835) Add a menu to the documentation of MLlib

2015-11-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11835.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9826
[https://github.com/apache/spark/pull/9826]

> Add a menu to the documentation of MLlib
> 
>
> Key: SPARK-11835
> URL: https://issues.apache.org/jira/browse/SPARK-11835
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 1.5.1
>Reporter: Tim Hunter
>Assignee: Tim Hunter
> Fix For: 1.6.0
>
> Attachments: Screen Shot 2015-11-18 at 4.50.45 PM.png
>
>
> Right now, the table of contents gets generated on a page-by-page basis, 
> which makes it hard to navigate between different topics in a project. We 
> should make use of the empty space on the left of the documentation to put a 
> navigation menu.
> A picture is worth a thousand words:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11917) Add SQLContext#dropTempTable to PySpark

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11917:


Assignee: (was: Apache Spark)

> Add SQLContext#dropTempTable to PySpark
> ---
>
> Key: SPARK-11917
> URL: https://issues.apache.org/jira/browse/SPARK-11917
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>
> Seems there's no api to drop table in pyspark now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11917) Add SQLContext#dropTempTable to PySpark

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021625#comment-15021625
 ] 

Apache Spark commented on SPARK-11917:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9903

> Add SQLContext#dropTempTable to PySpark
> ---
>
> Key: SPARK-11917
> URL: https://issues.apache.org/jira/browse/SPARK-11917
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>
> Seems there's no api to drop table in pyspark now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11917) Add SQLContext#dropTempTable to PySpark

2015-11-22 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11917:
--

 Summary: Add SQLContext#dropTempTable to PySpark
 Key: SPARK-11917
 URL: https://issues.apache.org/jira/browse/SPARK-11917
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Jeff Zhang
Priority: Minor


Seems there's no api to drop table in pyspark now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11894) Incorrect results are returned when using null

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11894:


Assignee: (was: Apache Spark)

> Incorrect results are returned when using null
> --
>
> Key: SPARK-11894
> URL: https://issues.apache.org/jira/browse/SPARK-11894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> In DataSet APIs, the following two datasets are the same. 
>   Seq((new java.lang.Integer(0), "1"), (new java.lang.Integer(22), 
> "2")).toDS()
>   Seq((null.asInstanceOf[java.lang.Integer],, "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> Note: java.lang.Integer is Nullable. 
> It could generate an incorrect result. For example, 
> val ds1 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> val ds2 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()//toDF("key", "value").as('df2)
> val res1 = ds1.joinWith(ds2, lit(true)).collect()
> The expected result should be 
> ((null,1),(null,1))
> ((22,2),(null,1))
> ((null,1),(22,2))
> ((22,2),(22,2))
> The actual result is 
> ((0,1),(0,1))
> ((22,2),(0,1))
> ((0,1),(22,2))
> ((22,2),(22,2))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11894) Incorrect results are returned when using null

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021644#comment-15021644
 ] 

Apache Spark commented on SPARK-11894:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9904

> Incorrect results are returned when using null
> --
>
> Key: SPARK-11894
> URL: https://issues.apache.org/jira/browse/SPARK-11894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> In DataSet APIs, the following two datasets are the same. 
>   Seq((new java.lang.Integer(0), "1"), (new java.lang.Integer(22), 
> "2")).toDS()
>   Seq((null.asInstanceOf[java.lang.Integer],, "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> Note: java.lang.Integer is Nullable. 
> It could generate an incorrect result. For example, 
> val ds1 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> val ds2 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()//toDF("key", "value").as('df2)
> val res1 = ds1.joinWith(ds2, lit(true)).collect()
> The expected result should be 
> ((null,1),(null,1))
> ((22,2),(null,1))
> ((null,1),(22,2))
> ((22,2),(22,2))
> The actual result is 
> ((0,1),(0,1))
> ((22,2),(0,1))
> ((0,1),(22,2))
> ((22,2),(22,2))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11894) Incorrect results are returned when using null

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11894:


Assignee: Apache Spark

> Incorrect results are returned when using null
> --
>
> Key: SPARK-11894
> URL: https://issues.apache.org/jira/browse/SPARK-11894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> In DataSet APIs, the following two datasets are the same. 
>   Seq((new java.lang.Integer(0), "1"), (new java.lang.Integer(22), 
> "2")).toDS()
>   Seq((null.asInstanceOf[java.lang.Integer],, "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> Note: java.lang.Integer is Nullable. 
> It could generate an incorrect result. For example, 
> val ds1 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> val ds2 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()//toDF("key", "value").as('df2)
> val res1 = ds1.joinWith(ds2, lit(true)).collect()
> The expected result should be 
> ((null,1),(null,1))
> ((22,2),(null,1))
> ((null,1),(22,2))
> ((22,2),(22,2))
> The actual result is 
> ((0,1),(0,1))
> ((22,2),(0,1))
> ((0,1),(22,2))
> ((22,2),(22,2))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11916) Expression TRIM/LTRIM/RTRIM to support specific trim word

2015-11-22 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-11916:
---

 Summary: Expression TRIM/LTRIM/RTRIM to support specific trim word
 Key: SPARK-11916
 URL: https://issues.apache.org/jira/browse/SPARK-11916
 Project: Spark
  Issue Type: Improvement
Reporter: Adrian Wang
Priority: Minor


supports expressions like `trim('xxxabcxxx', 'x')`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11909) Spark Standalone's master URL accepts URLs without port (assuming default 7077)

2015-11-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021669#comment-15021669
 ] 

Saisai Shao commented on SPARK-11909:
-

The master will print the master url in web UI and log. Since master is a 
daemon process, it is not so good to print in the console.

Also as [~srowen] suggested, it is better for user to explicitly specify the 
port number, this port is also used to differ whether you're submitting Spark 
application using binary protocol (7077) or REST (6066), if it can be ignored, 
it is hard for Spark itself to decide which port is the right port you want to 
submit to.

> Spark Standalone's master URL accepts URLs without port (assuming default 
> 7077)
> ---
>
> Key: SPARK-11909
> URL: https://issues.apache.org/jira/browse/SPARK-11909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It's currently impossible to use {{spark://localhost}} URL for Spark 
> Standalone's master. With the feature supported, it'd be less to know to get 
> started with the mode (and hence improve user friendliness).
> I think no-port master URL should be supported and assume the default port 
> {{7077}}.
> {code}
> org.apache.spark.SparkException: Invalid master URL: spark://localhost
>   at 
> org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2088)
>   at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.deploy.client.AppClient.(AppClient.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.start(SparkDeploySchedulerBackend.scala:93)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11604) ML 1.6 QA: API: Python API coverage

2015-11-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11604:

Description: 
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

* Inconsistency:
** ml.classification SPARK-11815 SPARK-11820

* Docs:
** ml.classification SPARK-11875

  was:
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

* Inconsistency:
** ml.classification 


> ML 1.6 QA: API: Python API coverage
> ---
>
> Key: SPARK-11604
> URL: https://issues.apache.org/jira/browse/SPARK-11604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below) for this list of to-do items.
> * Inconsistency:
> ** ml.classification SPARK-11815 SPARK-11820
> * Docs:
> ** ml.classification SPARK-11875



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11604) ML 1.6 QA: API: Python API coverage

2015-11-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11604:

Description: 
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

List the found issues:
* Inconsistency:
** ml.classification SPARK-11815 SPARK-11820

* Docs:
** ml.classification SPARK-11875

  was:
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

* Inconsistency:
** ml.classification SPARK-11815 SPARK-11820

* Docs:
** ml.classification SPARK-11875


> ML 1.6 QA: API: Python API coverage
> ---
>
> Key: SPARK-11604
> URL: https://issues.apache.org/jira/browse/SPARK-11604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below) for this list of to-do items.
> List the found issues:
> * Inconsistency:
> ** ml.classification SPARK-11815 SPARK-11820
> * Docs:
> ** ml.classification SPARK-11875



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11912) ml.feature.PCA minor refactor

2015-11-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11912:

Description: Like SPARK-11852, k is params and we should save it under 
metadata/ rather than both under data/ and metadata/. Refactor the constructor 
of ml.feature.PCAModel to take only pc but construct mllib.feature.PCAModel 
inside transform  (was: Like SPARK-11852, k is params and we should save it 
under metadata/ rather than both under data/ and metadata/. We construct 
mllib.feature.PCAModel inside transform.)

> ml.feature.PCA minor refactor
> -
>
> Key: SPARK-11912
> URL: https://issues.apache.org/jira/browse/SPARK-11912
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Like SPARK-11852, k is params and we should save it under metadata/ rather 
> than both under data/ and metadata/. Refactor the constructor of 
> ml.feature.PCAModel to take only pc but construct mllib.feature.PCAModel 
> inside transform



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11912) ml.feature.PCA minor refactor

2015-11-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11912:

Description: Like SPARK-11852, k is params and we should save it under 
metadata/ rather than both under data/ and metadata/. Refactor the constructor 
of ml.feature.PCAModel to take only pc but construct mllib.feature.PCAModel 
inside transform.  (was: Like SPARK-11852, k is params and we should save it 
under metadata/ rather than both under data/ and metadata/. Refactor the 
constructor of ml.feature.PCAModel to take only pc but construct 
mllib.feature.PCAModel inside transform)

> ml.feature.PCA minor refactor
> -
>
> Key: SPARK-11912
> URL: https://issues.apache.org/jira/browse/SPARK-11912
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Like SPARK-11852, k is params and we should save it under metadata/ rather 
> than both under data/ and metadata/. Refactor the constructor of 
> ml.feature.PCAModel to take only pc but construct mllib.feature.PCAModel 
> inside transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11619) cannot use UDTF in DataFrame.selectExpr

2015-11-22 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021509#comment-15021509
 ] 

Wenchen Fan commented on SPARK-11619:
-

Actually this line: 
https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L689

When we use `selectExpr`, we pass in `UnresolvedFunction` to `DataFrame.select` 
and fall in the last case. A workaround is to do special handling for UDTF like 
we did for `explode`(and `json_tuple` in 1.6), wrap it with `MultiAlias`.

Another workaround is using `expr`, for example, 
`df.select(expr("explode(a)").as(Nil))`, I think `selectExpr` is no longer 
needed after we have the `expr` function

> cannot use UDTF in DataFrame.selectExpr
> ---
>
> Key: SPARK-11619
> URL: https://issues.apache.org/jira/browse/SPARK-11619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>
> Currently if use UDTF like `explode`, `json_tuple` in `DataFrame.selectExpr`, 
> it will be parsed into `UnresolvedFunction` first, and then alias it with 
> `expr.prettyString`. However, UDTF may need MultiAlias so we will get error 
> if we run:
> {code}
> val df = Seq((Map("1" -> 1), 1)).toDF("a", "b")
> df.selectExpr("explode(a)").show()
> {code}
> [info]   org.apache.spark.sql.AnalysisException: Expect multiple names given 
> for org.apache.spark.sql.catalyst.expressions.Explode,
> [info] but only single name ''explode(a)' specified;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021511#comment-15021511
 ] 

Patrick Wendell commented on SPARK-11903:
-

I think it's simply dead code. SKIP_JAVA_TEST related to a check we did 
regarding whether Java 6 was being used instead of Java 7. It doesn't have 
anything to do with unit tests. Spark now requires Java 7, so the test has been 
removed, but the parser still handles that variable. It was just an omission 
not deleted as part of SPARK-7733 
(https://github.com/apache/spark/commit/e84815dc333a69368a48e0152f02934980768a14)
 /cc [~srowen].

> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73],
>  and tests are [always 
> skipped|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L170].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than [this 
> one|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11906) Speculation Tasks Cause ProgressBar UI Overflow

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021228#comment-15021228
 ] 

Apache Spark commented on SPARK-11906:
--

User 'saurfang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9896

> Speculation Tasks Cause ProgressBar UI Overflow
> ---
>
> Key: SPARK-11906
> URL: https://issues.apache.org/jira/browse/SPARK-11906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Sen Fang
>Priority: Trivial
>
> When there are speculative tasks in stage, the started tasks + completed 
> tasks can be greater than total number of tasks. It leads to the started 
> progress block to overflow to next line. Visually the light blue progress 
> block becomes no longer visible when it happens.
> The fix should be as trivial as to cap the number of started task by total - 
> completed task.
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L322



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11906) Speculation Tasks Cause ProgressBar UI Overflow

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11906:


Assignee: (was: Apache Spark)

> Speculation Tasks Cause ProgressBar UI Overflow
> ---
>
> Key: SPARK-11906
> URL: https://issues.apache.org/jira/browse/SPARK-11906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Sen Fang
>Priority: Trivial
>
> When there are speculative tasks in stage, the started tasks + completed 
> tasks can be greater than total number of tasks. It leads to the started 
> progress block to overflow to next line. Visually the light blue progress 
> block becomes no longer visible when it happens.
> The fix should be as trivial as to cap the number of started task by total - 
> completed task.
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L322



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11906) Speculation Tasks Cause ProgressBar UI Overflow

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11906:


Assignee: Apache Spark

> Speculation Tasks Cause ProgressBar UI Overflow
> ---
>
> Key: SPARK-11906
> URL: https://issues.apache.org/jira/browse/SPARK-11906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Sen Fang
>Assignee: Apache Spark
>Priority: Trivial
>
> When there are speculative tasks in stage, the started tasks + completed 
> tasks can be greater than total number of tasks. It leads to the started 
> progress block to overflow to next line. Visually the light blue progress 
> block becomes no longer visible when it happens.
> The fix should be as trivial as to cap the number of started task by total - 
> completed task.
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L322



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11730) Feature Importance for GBT

2015-11-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021301#comment-15021301
 ] 

Joseph K. Bradley commented on SPARK-11730:
---

I wrote that note since I did not have time to research what people do for 
GBTs.  I'd be Ok with matching sklearn's implementation, though it would be 
great if we could find academic work indicating a "right" way to handle GBTs.  
In particular, I am not sure if trees' contributions should be weighted 
differently (based on the learning process) or if they should just use the tree 
weights (resembling how prediction works).

> Feature Importance for GBT
> --
>
> Key: SPARK-11730
> URL: https://issues.apache.org/jira/browse/SPARK-11730
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Brian Webb
>
> Random Forests have feature importance, but GBT do not. It would be great if 
> we can add feature importance to GBT as well. Perhaps the code in Random 
> Forests can be refactored to apply to both types of ensembles.
> See https://issues.apache.org/jira/browse/SPARK-5133



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10129) math function: stddev_samp

2015-11-22 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021319#comment-15021319
 ] 

Yin Huai commented on SPARK-10129:
--

We have stddev_samp in agg functions. Should we resolve this? Or, it is stddev 
for a list of numbers?

> math function: stddev_samp
> --
>
> Key: SPARK-10129
> URL: https://issues.apache.org/jira/browse/SPARK-10129
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Use the STDDEV_SAMP function to return the standard deviation of a sample 
> variance.
> http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.bigsql.doc/doc/bsql_stdev_samp.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10129) math function: stddev_samp

2015-11-22 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021319#comment-15021319
 ] 

Yin Huai edited comment on SPARK-10129 at 11/23/15 1:16 AM:


We have stddev_samp in agg functions. Should we resolve this? Or, it is stddev 
for a value of an array type?


was (Author: yhuai):
We have stddev_samp in agg functions. Should we resolve this? Or, it is stddev 
for a list of numbers?

> math function: stddev_samp
> --
>
> Key: SPARK-10129
> URL: https://issues.apache.org/jira/browse/SPARK-10129
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Use the STDDEV_SAMP function to return the standard deviation of a sample 
> variance.
> http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.bigsql.doc/doc/bsql_stdev_samp.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9506) DataFrames Postgresql JDBC unable to support most of the Postgresql's Data Type

2015-11-22 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021499#comment-15021499
 ] 

Wenchen Fan commented on SPARK-9506:


I think it's not a workaround, but the right thing to do. We already have a 
`PostgreDialect` and we can add more support for non-standard sql types.

> DataFrames Postgresql JDBC unable to support most of the Postgresql's Data 
> Type
> ---
>
> Key: SPARK-9506
> URL: https://issues.apache.org/jira/browse/SPARK-9506
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Pangjiu
> Attachments: code.PNG, log.PNG, tables_structures.PNG
>
>
> Hi All,
> I have issue on using Postgresql JDBC with sqlContext for postgresql's data 
> types: eg: abstime, character varying[], int2vector, json and etc.
> Exception are "Unsupported type 2003" and "Unsupported type ".
> Below is the code:
> Class.forName("org.postgresql.Driver").newInstance()
> val url = "jdbc:postgresql://localhost:5432/sample?user=posgres=xxx"
> val driver = "org.postgresql.Driver"
> val output = { sqlContext.load("jdbc", Map 
>   (
>   "url" -> url,
>   "driver" -> driver,
>   "dbtable" -> "(SELECT `ID`, `NAME` FROM 
> `agent`) AS tableA "
>   )
>   )
> }
> Hope SQL Context can support all the data types.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021511#comment-15021511
 ] 

Patrick Wendell edited comment on SPARK-11903 at 11/23/15 4:29 AM:
---

I think it's simply dead code that should be deleted. SKIP_JAVA_TEST related to 
a check we did regarding whether Java 6 was being used instead of Java 7. It 
doesn't have anything to do with unit tests. Spark now requires Java 7, so the 
test has been removed, but the parser still handles that variable. It was just 
an omission not deleted as part of SPARK-7733 
(https://github.com/apache/spark/commit/e84815dc333a69368a48e0152f02934980768a14)
 /cc [~srowen].


was (Author: pwendell):
I think it's simply dead code. SKIP_JAVA_TEST related to a check we did 
regarding whether Java 6 was being used instead of Java 7. It doesn't have 
anything to do with unit tests. Spark now requires Java 7, so the test has been 
removed, but the parser still handles that variable. It was just an omission 
not deleted as part of SPARK-7733 
(https://github.com/apache/spark/commit/e84815dc333a69368a48e0152f02934980768a14)
 /cc [~srowen].

> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73],
>  and tests are [always 
> skipped|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L170].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than [this 
> one|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11206) Support SQL UI on the history server

2015-11-22 Thread Carson Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021363#comment-15021363
 ] 

Carson Wang commented on SPARK-11206:
-

To support SQL UI on the history server:
1. I added an onOtherEvent method to the SparkListener trait and post all SQL 
related events to the same event bus.
2. Two SQL events SparkListenerSQLExecutionStart and 
SparkListenerSQLExecutionEnd are defined in the sql module.
3. The new SQL events are written to event log using Jackson.
4. A new trait SparkHistoryListenerFactory is added to allow the history server 
to feed events to the SQL history listener. The SQL implementation is loaded at 
runtime using java.util.ServiceLoader.

> Support SQL UI on the history server
> 
>
> Key: SPARK-11206
> URL: https://issues.apache.org/jira/browse/SPARK-11206
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Web UI
>Reporter: Carson Wang
>
> On the live web UI, there is a SQL tab which provides valuable information 
> for the SQL query. But once the workload is finished, we won't see the SQL 
> tab on the history server. It will be helpful if we support SQL UI on the 
> history server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11915) Fix flaky python test pyspark.sql.group

2015-11-22 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-11915:

Description: 
The python test pyspark.sql.group will fail due to items' order in returned 
array. We should sort the aggregation results to make the test stable.


  was:
The python test pyspark.sql.group fails due to items' order in returned array. 
We should fix it.



> Fix flaky python test pyspark.sql.group
> ---
>
> Key: SPARK-11915
> URL: https://issues.apache.org/jira/browse/SPARK-11915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>
> The python test pyspark.sql.group will fail due to items' order in returned 
> array. We should sort the aggregation results to make the test stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11915) Fix flaky python test pyspark.sql.group

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11915:


Assignee: Apache Spark

> Fix flaky python test pyspark.sql.group
> ---
>
> Key: SPARK-11915
> URL: https://issues.apache.org/jira/browse/SPARK-11915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> The python test pyspark.sql.group will fail due to items' order in returned 
> array. We should sort the aggregation results to make the test stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11915) Fix flaky python test pyspark.sql.group

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021498#comment-15021498
 ] 

Apache Spark commented on SPARK-11915:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9900

> Fix flaky python test pyspark.sql.group
> ---
>
> Key: SPARK-11915
> URL: https://issues.apache.org/jira/browse/SPARK-11915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>
> The python test pyspark.sql.group will fail due to items' order in returned 
> array. We should sort the aggregation results to make the test stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11915) Fix flaky python test pyspark.sql.group

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11915:


Assignee: (was: Apache Spark)

> Fix flaky python test pyspark.sql.group
> ---
>
> Key: SPARK-11915
> URL: https://issues.apache.org/jira/browse/SPARK-11915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>
> The python test pyspark.sql.group will fail due to items' order in returned 
> array. We should sort the aggregation results to make the test stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11861) Feature importances for decision trees

2015-11-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021297#comment-15021297
 ] 

Joseph K. Bradley commented on SPARK-11861:
---

Exposing the single-tree API for this sounds fine to me.  I hid it originally 
because I did not have the time to research whether people trusted importance 
values from single trees.  Do you know if other libraries provide this?

> Feature importances for decision trees
> --
>
> Key: SPARK-11861
> URL: https://issues.apache.org/jira/browse/SPARK-11861
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Feature importances should be added to decision trees leveraging the feature 
> importance implementation for Random Forests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11604) ML 1.6 QA: API: Python API coverage

2015-11-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11604:

Description: 
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

* Inconsistency:
** ml.classification 

  was:
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.


> ML 1.6 QA: API: Python API coverage
> ---
>
> Key: SPARK-11604
> URL: https://issues.apache.org/jira/browse/SPARK-11604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below) for this list of to-do items.
> * Inconsistency:
> ** ml.classification 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11913) support typed aggregate for complex buffer schema

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021454#comment-15021454
 ] 

Apache Spark commented on SPARK-11913:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9898

> support typed aggregate for complex buffer schema
> -
>
> Key: SPARK-11913
> URL: https://issues.apache.org/jira/browse/SPARK-11913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11913) support typed aggregate for complex buffer schema

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11913:


Assignee: Apache Spark

> support typed aggregate for complex buffer schema
> -
>
> Key: SPARK-11913
> URL: https://issues.apache.org/jira/browse/SPARK-11913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11913) support typed aggregate for complex buffer schema

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11913:


Assignee: (was: Apache Spark)

> support typed aggregate for complex buffer schema
> -
>
> Key: SPARK-11913
> URL: https://issues.apache.org/jira/browse/SPARK-11913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11600) Spark MLlib 1.6 QA umbrella

2015-11-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11600:
--
Description: 
This JIRA lists tasks for the next MLlib release's QA period.

h2. API

* Check binary API compatibility (SPARK-11601)
* Audit new public APIs (from the generated html doc)
** Scala (SPARK-11602)
** Java compatibility (SPARK-11605)
** Python coverage (SPARK-11604)
* Check Experimental, DeveloperApi tags (SPARK-11603)

h2. Algorithms and performance

*Performance*
* _List any other missing performance tests from spark-perf here_
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python (SPARK-7539)
* perf-tests for transformers (SPARK-2838)
* MultilayerPerceptron (SPARK-11911)

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide (SPARK-11606)
* For major components, create JIRAs for example code (SPARK-9670)
* Update Programming Guide for 1.6 (towards end of QA) (SPARK-11608)
* Update website (SPARK-11607)
* Merge duplicate content under examples/ (SPARK-11685)

  was:
This JIRA lists tasks for the next MLlib release's QA period.

h2. API

* Check binary API compatibility (SPARK-11601)
* Audit new public APIs (from the generated html doc)
** Scala (SPARK-11602)
** Java compatibility (SPARK-11605)
** Python coverage (SPARK-11604)
* Check Experimental, DeveloperApi tags (SPARK-11603)

h2. Algorithms and performance

*Performance*
* _List any other missing performance tests from spark-perf here_
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python (SPARK-7539)
* perf-tests for transformers (SPARK-2838)

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide (SPARK-11606)
* For major components, create JIRAs for example code (SPARK-9670)
* Update Programming Guide for 1.6 (towards end of QA) (SPARK-11608)
* Update website (SPARK-11607)
* Merge duplicate content under examples/ (SPARK-11685)


> Spark MLlib 1.6 QA umbrella
> ---
>
> Key: SPARK-11600
> URL: https://issues.apache.org/jira/browse/SPARK-11600
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next MLlib release's QA period.
> h2. API
> * Check binary API compatibility (SPARK-11601)
> * Audit new public APIs (from the generated html doc)
> ** Scala (SPARK-11602)
> ** Java compatibility (SPARK-11605)
> ** Python coverage (SPARK-11604)
> * Check Experimental, DeveloperApi tags (SPARK-11603)
> h2. Algorithms and performance
> *Performance*
> * _List any other missing performance tests from spark-perf here_
> * ALS.recommendAll (SPARK-7457)
> * perf-tests in Python (SPARK-7539)
> * perf-tests for transformers (SPARK-2838)
> * MultilayerPerceptron (SPARK-11911)
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide (SPARK-11606)
> * For major components, create JIRAs for example code (SPARK-9670)
> * Update Programming Guide for 1.6 (towards end of QA) (SPARK-11608)
> * Update website (SPARK-11607)
> * Merge duplicate content under examples/ (SPARK-11685)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11911) spark-perf test for MultilayerPerceptron

2015-11-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11911:
-

 Summary: spark-perf test for MultilayerPerceptron
 Key: SPARK-11911
 URL: https://issues.apache.org/jira/browse/SPARK-11911
 Project: Spark
  Issue Type: Test
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


Create a test in spark-perf for MultilayerPerceptron



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11913) support typed aggregate for complex buffer schema

2015-11-22 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11913:
---

 Summary: support typed aggregate for complex buffer schema
 Key: SPARK-11913
 URL: https://issues.apache.org/jira/browse/SPARK-11913
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11915) Fix flaky python test pyspark.sql.group

2015-11-22 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-11915:
---

 Summary: Fix flaky python test pyspark.sql.group
 Key: SPARK-11915
 URL: https://issues.apache.org/jira/browse/SPARK-11915
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Liang-Chi Hsieh


The python test pyspark.sql.group fails due to items' order in returned array. 
We should fix it.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11912) ml.feature.PCA minor refactor

2015-11-22 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11912:
---

 Summary: ml.feature.PCA minor refactor
 Key: SPARK-11912
 URL: https://issues.apache.org/jira/browse/SPARK-11912
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


Like SPARK-11852, k is params and we should save it under metadata/ rather than 
both under data/ and metadata/. We construct mllib.feature.PCAModel inside 
transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11914) [SQL] Support coalesce and repartition in Dataset APIs

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021478#comment-15021478
 ] 

Apache Spark commented on SPARK-11914:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/9899

> [SQL] Support coalesce and repartition in Dataset APIs
> --
>
> Key: SPARK-11914
> URL: https://issues.apache.org/jira/browse/SPARK-11914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> repartition: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions.
> coalesce: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions. Similar to coalesce defined on an [[RDD]], this operation results 
> in a narrow dependency, e.g. if you go from 1000 partitions to 100 
> partitions, there will not be a shuffle, instead each of the 100 new 
> partitions will claim 10 of the current partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11914) [SQL] Support coalesce and repartition in Dataset APIs

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11914:


Assignee: (was: Apache Spark)

> [SQL] Support coalesce and repartition in Dataset APIs
> --
>
> Key: SPARK-11914
> URL: https://issues.apache.org/jira/browse/SPARK-11914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> repartition: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions.
> coalesce: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions. Similar to coalesce defined on an [[RDD]], this operation results 
> in a narrow dependency, e.g. if you go from 1000 partitions to 100 
> partitions, there will not be a shuffle, instead each of the 100 new 
> partitions will claim 10 of the current partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11914) [SQL] Support coalesce and repartition in Dataset APIs

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11914:


Assignee: Apache Spark

> [SQL] Support coalesce and repartition in Dataset APIs
> --
>
> Key: SPARK-11914
> URL: https://issues.apache.org/jira/browse/SPARK-11914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> repartition: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions.
> coalesce: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions. Similar to coalesce defined on an [[RDD]], this operation results 
> in a narrow dependency, e.g. if you go from 1000 partitions to 100 
> partitions, there will not be a shuffle, instead each of the 100 new 
> partitions will claim 10 of the current partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11914) [SQL] Support coalesce and repartition in Dataset APIs

2015-11-22 Thread Xiao Li (JIRA)
Xiao Li created SPARK-11914:
---

 Summary: [SQL] Support coalesce and repartition in Dataset APIs
 Key: SPARK-11914
 URL: https://issues.apache.org/jira/browse/SPARK-11914
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Xiao Li


repartition: Returns a new [[Dataset]] that has exactly `numPartitions` 
partitions.

coalesce: Returns a new [[Dataset]] that has exactly `numPartitions` 
partitions. Similar to coalesce defined on an [[RDD]], this operation results 
in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, 
there will not be a shuffle, instead each of the 100 new partitions will claim 
10 of the current partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9506) DataFrames Postgresql JDBC unable to support most of the Postgresql's Data Type

2015-11-22 Thread Marius Van Niekerk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021500#comment-15021500
 ] 

Marius Van Niekerk commented on SPARK-9506:
---

Quite a few more additional types are supported in 1.6.  

> DataFrames Postgresql JDBC unable to support most of the Postgresql's Data 
> Type
> ---
>
> Key: SPARK-9506
> URL: https://issues.apache.org/jira/browse/SPARK-9506
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Pangjiu
> Attachments: code.PNG, log.PNG, tables_structures.PNG
>
>
> Hi All,
> I have issue on using Postgresql JDBC with sqlContext for postgresql's data 
> types: eg: abstime, character varying[], int2vector, json and etc.
> Exception are "Unsupported type 2003" and "Unsupported type ".
> Below is the code:
> Class.forName("org.postgresql.Driver").newInstance()
> val url = "jdbc:postgresql://localhost:5432/sample?user=posgres=xxx"
> val driver = "org.postgresql.Driver"
> val output = { sqlContext.load("jdbc", Map 
>   (
>   "url" -> url,
>   "driver" -> driver,
>   "dbtable" -> "(SELECT `ID`, `NAME` FROM 
> `agent`) AS tableA "
>   )
>   )
> }
> Hope SQL Context can support all the data types.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')

2015-11-22 Thread Tycho Grouwstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tycho Grouwstra updated SPARK-11907:

Description: 
I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

{code}
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
{code}

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

{code}
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
{code}

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 
exceptions.

I'm sort of new here but would be glad to get some opinions on this idea.


  was:
I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

[code]
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
[/code]

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

[code]
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
[/code]

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 

[jira] [Created] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')

2015-11-22 Thread Tycho Grouwstra (JIRA)
Tycho Grouwstra created SPARK-11907:
---

 Summary: Allowing errors as values in DataFrames (like 'Either 
Left/Right')
 Key: SPARK-11907
 URL: https://issues.apache.org/jira/browse/SPARK-11907
 Project: Spark
  Issue Type: Wish
  Components: SQL
Reporter: Tycho Grouwstra


I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
```

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

```
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
```

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 
exceptions.

I'm sort of new here but would be glad to get some opinions on this idea.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')

2015-11-22 Thread Tycho Grouwstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tycho Grouwstra updated SPARK-11907:

Description: 
I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

[code]
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
[/code]

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

[code]
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
[/code]

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 
exceptions.

I'm sort of new here but would be glad to get some opinions on this idea.


  was:
I like Spark, but one thing I find funny about it is that it is picky about 
circumstantial errors. For one, given the following:

```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
val div = udf[Double, Integer](10 / _)
df.withColumn("div", div(col("num"))).show()
```

... the job fails with a `java.lang.ArithmeticException: / by zero`.

The example is trivial, but my point is, if one thing goes wrong, the rest goes 
right, why throw out the baby with the bathwater when you could both show what 
went wrong as well as went right?

Instead, I would propose allowing to use raised Exceptions as resulting values, 
not unlike how one might store 'bad' results using Either Left/Right 
constructions in Scala/Haskell (which I suppose would not currently work in 
DFs, lacking serializability), or cells containing errors in MS Excel.

As a solution, I would propose a DataFrame subclass (?) using a variant of 
NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
NullableColumnBuilder currently explains its workings as follows:

```
/**
 * A stackable trait used for building byte buffer for a column containing null 
values.  Memory
 * layout of the final byte buffer is:
 * {{{
 *.--- Null count N (4 bytes)
 *|   .--- Null positions (4 x N bytes, empty if null count is 
zero)
 *|   | .- Non-null elements
 *V   V V
 *   +---+-+-+
 *   |   | ... | ... ... |
 *   +---+-+-+
 * }}}
 */
```

This might be extended by adding a further section storing Throwables (or null) 
for the bad values in question (alt: store count/positions separately from null 
ones so null values would not need to be stored). 

Don't get me wrong, there is nothing with throwing exceptions (or catching them 
for that matter). Rather, I see a use cases for both "do it right or bust" vs. 
the explorative "show me what happens if I try this operation on these values" 
-- not unlike how languages as Ruby/Elixir might distinguish unsafe methods 
using a bang ('!') from their safe variants that should not throw global 
exceptions.

I'm sort 

[jira] [Updated] (SPARK-11716) UDFRegistration Drops Input Type Information

2015-11-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11716:
--
Assignee: Yin Huai

> UDFRegistration Drops Input Type Information
> 
>
> Key: SPARK-11716
> URL: https://issues.apache.org/jira/browse/SPARK-11716
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Artjom Metro
>Assignee: Yin Huai
>  Labels: sql, udf
> Fix For: 1.6.0
>
>
> The UserDefinedFunction returned by the UDFRegistration does not contain the 
> input type information, although that information is available.
> To fix the issue the last line of every register function would had to be 
> changed to "UserDefinedFunction(func, dataType, inputType)" or is there any 
> specific reason this was not done?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11910) Streaming programming guide references wrong dependency version

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021059#comment-15021059
 ] 

Apache Spark commented on SPARK-11910:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/9892

> Streaming programming guide references wrong dependency version
> ---
>
> Key: SPARK-11910
> URL: https://issues.apache.org/jira/browse/SPARK-11910
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Streaming
>Affects Versions: 1.6.0
>Reporter: Luciano Resende
>Priority: Minor
>
> SPARK-11245 have upgraded twitter dependency to 4.0.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11910) Streaming programming guide references wrong dependency version

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11910:


Assignee: Apache Spark

> Streaming programming guide references wrong dependency version
> ---
>
> Key: SPARK-11910
> URL: https://issues.apache.org/jira/browse/SPARK-11910
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Streaming
>Affects Versions: 1.6.0
>Reporter: Luciano Resende
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-11245 have upgraded twitter dependency to 4.0.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10521) Utilize Docker to test DB2 JDBC Dialect support

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10521:


Assignee: (was: Apache Spark)

> Utilize Docker to test DB2 JDBC Dialect support
> ---
>
> Key: SPARK-10521
> URL: https://issues.apache.org/jira/browse/SPARK-10521
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Luciano Resende
>
> There was a discussion in SPARK-10170 around using a docker image to execute 
> the DB2 JDBC dialect tests. I will use this jira to work on providing the 
> basic image together with the test integration. We can then extend the 
> testing coverage as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10521) Utilize Docker to test DB2 JDBC Dialect support

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021061#comment-15021061
 ] 

Apache Spark commented on SPARK-10521:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/9893

> Utilize Docker to test DB2 JDBC Dialect support
> ---
>
> Key: SPARK-10521
> URL: https://issues.apache.org/jira/browse/SPARK-10521
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Luciano Resende
>
> There was a discussion in SPARK-10170 around using a docker image to execute 
> the DB2 JDBC dialect tests. I will use this jira to work on providing the 
> basic image together with the test integration. We can then extend the 
> testing coverage as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10521) Utilize Docker to test DB2 JDBC Dialect support

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10521:


Assignee: Apache Spark

> Utilize Docker to test DB2 JDBC Dialect support
> ---
>
> Key: SPARK-10521
> URL: https://issues.apache.org/jira/browse/SPARK-10521
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Luciano Resende
>Assignee: Apache Spark
>
> There was a discussion in SPARK-10170 around using a docker image to execute 
> the DB2 JDBC dialect tests. I will use this jira to work on providing the 
> basic image together with the test integration. We can then extend the 
> testing coverage as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11910) Streaming programming guide references wrong dependency version

2015-11-22 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-11910:
---

 Summary: Streaming programming guide references wrong dependency 
version
 Key: SPARK-11910
 URL: https://issues.apache.org/jira/browse/SPARK-11910
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Streaming
Affects Versions: 1.6.0
Reporter: Luciano Resende
Priority: Minor


SPARK-11245 have upgraded twitter dependency to 4.0.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11910) Streaming programming guide references wrong dependency version

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11910:


Assignee: (was: Apache Spark)

> Streaming programming guide references wrong dependency version
> ---
>
> Key: SPARK-11910
> URL: https://issues.apache.org/jira/browse/SPARK-11910
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Streaming
>Affects Versions: 1.6.0
>Reporter: Luciano Resende
>Priority: Minor
>
> SPARK-11245 have upgraded twitter dependency to 4.0.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11908) Add NullType support to RowEncoder

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11908:


Assignee: Apache Spark

> Add NullType support to RowEncoder
> --
>
> Key: SPARK-11908
> URL: https://issues.apache.org/jira/browse/SPARK-11908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We should add NullType support to RowEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11909) Spark Standalone's master URL accepts URLs without port (assuming default 7077)

2015-11-22 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-11909:
---

 Summary: Spark Standalone's master URL accepts URLs without port 
(assuming default 7077)
 Key: SPARK-11909
 URL: https://issues.apache.org/jira/browse/SPARK-11909
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Jacek Laskowski
Priority: Trivial


It's currently impossible to use {{spark://localhost}} URL for Spark 
Standalone's master. With the feature supported, it'd be less to know to get 
started with the mode (and hence improve user friendliness).

I think no-port master URL should be supported and assume the default port 
{{7077}}.

{code}
org.apache.spark.SparkException: Invalid master URL: spark://localhost
at 
org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2088)
at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
at 
org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
at 
org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.client.AppClient.(AppClient.scala:48)
at 
org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.start(SparkDeploySchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:530)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11909) Spark Standalone's master URL accepts URLs without port (assuming default 7077)

2015-11-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020976#comment-15020976
 ] 

Sean Owen commented on SPARK-11909:
---

I disagree. The default is not a well-known port like 80 for HTTP. It makes 
sense to avoid confusion by explicitly stating the port, as with launching the 
master.

> Spark Standalone's master URL accepts URLs without port (assuming default 
> 7077)
> ---
>
> Key: SPARK-11909
> URL: https://issues.apache.org/jira/browse/SPARK-11909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It's currently impossible to use {{spark://localhost}} URL for Spark 
> Standalone's master. With the feature supported, it'd be less to know to get 
> started with the mode (and hence improve user friendliness).
> I think no-port master URL should be supported and assume the default port 
> {{7077}}.
> {code}
> org.apache.spark.SparkException: Invalid master URL: spark://localhost
>   at 
> org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2088)
>   at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.deploy.client.AppClient.(AppClient.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.start(SparkDeploySchedulerBackend.scala:93)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11065) IOException thrown at job submit shutdown

2015-11-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020982#comment-15020982
 ] 

Jean-Baptiste Onofré commented on SPARK-11065:
--

It's not really a problem, but IMHO, it's a bit annoying and can disturb users 
(as they may think about a real problem).

Let me dig a bit to find the cause and submit a PR.

NB: it happens only with 1.6.0-SNAPSHOT, 1.5.x is fine.

> IOException thrown at job submit shutdown
> -
>
> Key: SPARK-11065
> URL: https://issues.apache.org/jira/browse/SPARK-11065
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: Jean-Baptiste Onofré
>Priority: Minor
>
> When submitted a job (for instance JavaWordCount example), even if the job 
> works fine, at the end of execution, we can see:
> {code}
> checkForCorruptJournalFiles="true": 1
> 15/10/12 16:31:12 INFO SparkUI: Stopped Spark web UI at 
> http://192.168.134.10:4040
> 15/10/12 16:31:12 INFO DAGScheduler: Stopping DAGScheduler
> 15/10/12 16:31:12 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 15/10/12 16:31:12 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 15/10/12 16:31:12 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 15/10/12 16:31:12 INFO MemoryStore: MemoryStore cleared
> 15/10/12 16:31:12 INFO BlockManager: BlockManager stopped
> 15/10/12 16:31:12 INFO BlockManagerMaster: BlockManagerMaster stopped
> 15/10/12 16:31:12 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 15/10/12 16:31:12 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from localhost/127.0.0.1:7077 is closed
> 15/10/12 16:31:12 ERROR NettyRpcEnv: Exception when sending 
> RequestMessage(192.168.134.10:40548,NettyRpcEndpointRef(spark://Master@localhost:7077),UnregisterApplication(app-20151012163109-),false)
> java.io.IOException: Connection from localhost/127.0.0.1:7077 closed
> at 
> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 15/10/12 16:31:12 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/10/12 16:31:12 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> 15/10/12 16:31:12 INFO SparkContext: Successfully stopped SparkContext
> 15/10/12 16:31:12 INFO ShutdownHookManager: Shutdown hook called
> 15/10/12 16:31:12 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-81bc4324-1268-4e54-bdd2-f7a2a36dafd4
> {code}
> I gonna investigate about that and I will submit a PR.



--
This message was 

[jira] [Commented] (SPARK-11908) Add NullType support to RowEncoder

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020935#comment-15020935
 ] 

Apache Spark commented on SPARK-11908:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9891

> Add NullType support to RowEncoder
> --
>
> Key: SPARK-11908
> URL: https://issues.apache.org/jira/browse/SPARK-11908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We should add NullType support to RowEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11908) Add NullType support to RowEncoder

2015-11-22 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-11908:
---

 Summary: Add NullType support to RowEncoder
 Key: SPARK-11908
 URL: https://issues.apache.org/jira/browse/SPARK-11908
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


We should add NullType support to RowEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11908) Add NullType support to RowEncoder

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11908:


Assignee: (was: Apache Spark)

> Add NullType support to RowEncoder
> --
>
> Key: SPARK-11908
> URL: https://issues.apache.org/jira/browse/SPARK-11908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We should add NullType support to RowEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11065) IOException thrown at job submit shutdown

2015-11-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020981#comment-15020981
 ] 

Maciej Bryński commented on SPARK-11065:


[~srowen]
OK. But this issue is new in 1.6.0.

> IOException thrown at job submit shutdown
> -
>
> Key: SPARK-11065
> URL: https://issues.apache.org/jira/browse/SPARK-11065
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: Jean-Baptiste Onofré
>Priority: Minor
>
> When submitted a job (for instance JavaWordCount example), even if the job 
> works fine, at the end of execution, we can see:
> {code}
> checkForCorruptJournalFiles="true": 1
> 15/10/12 16:31:12 INFO SparkUI: Stopped Spark web UI at 
> http://192.168.134.10:4040
> 15/10/12 16:31:12 INFO DAGScheduler: Stopping DAGScheduler
> 15/10/12 16:31:12 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 15/10/12 16:31:12 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 15/10/12 16:31:12 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 15/10/12 16:31:12 INFO MemoryStore: MemoryStore cleared
> 15/10/12 16:31:12 INFO BlockManager: BlockManager stopped
> 15/10/12 16:31:12 INFO BlockManagerMaster: BlockManagerMaster stopped
> 15/10/12 16:31:12 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 15/10/12 16:31:12 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from localhost/127.0.0.1:7077 is closed
> 15/10/12 16:31:12 ERROR NettyRpcEnv: Exception when sending 
> RequestMessage(192.168.134.10:40548,NettyRpcEndpointRef(spark://Master@localhost:7077),UnregisterApplication(app-20151012163109-),false)
> java.io.IOException: Connection from localhost/127.0.0.1:7077 closed
> at 
> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 15/10/12 16:31:12 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/10/12 16:31:12 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> 15/10/12 16:31:12 INFO SparkContext: Successfully stopped SparkContext
> 15/10/12 16:31:12 INFO ShutdownHookManager: Shutdown hook called
> 15/10/12 16:31:12 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-81bc4324-1268-4e54-bdd2-f7a2a36dafd4
> {code}
> I gonna investigate about that and I will submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Commented] (SPARK-11826) Subtract BlockMatrix

2015-11-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020979#comment-15020979
 ] 

Sean Owen commented on SPARK-11826:
---

OK, given the existence of add(), this probably makes some sense for 
completeness. It's minor, so best to keep the implementation light. Can you 
implement add() and subtract() in terms of one common function that takes an 
associative operation on matrices in Breeze?

> Subtract BlockMatrix
> 
>
> Key: SPARK-11826
> URL: https://issues.apache.org/jira/browse/SPARK-11826
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Ehsan Mohyedin Kermani
>Priority: Minor
>
> It'd be more convenient to have subtract method for BlockMatrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11065) IOException thrown at job submit shutdown

2015-11-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020939#comment-15020939
 ] 

Maciej Bryński commented on SPARK-11065:


I have the same error.
Job run successfully, but this output is misleading.
{code}
15/11/22 11:51:47 ERROR TransportResponseHandler: Still have 1 requests 
outstanding when connection from XXX:7077 is closed
15/11/22 11:51:48 WARN NettyRpcEnv: Exception when sending 
RequestMessage(178.33.61.44:39524,NettyRpcEndpointRef(spark://Master@XXX:7077),UnregisterApplication(app-20151122110204-),false)
java.io.IOException: Connection from XXX:7077 closed
at 
org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:116)
at 
org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:94)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
{code}



> IOException thrown at job submit shutdown
> -
>
> Key: SPARK-11065
> URL: https://issues.apache.org/jira/browse/SPARK-11065
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: Jean-Baptiste Onofré
>Priority: Minor
>
> When submitted a job (for instance JavaWordCount example), even if the job 
> works fine, at the end of execution, we can see:
> {code}
> checkForCorruptJournalFiles="true": 1
> 15/10/12 16:31:12 INFO SparkUI: Stopped Spark web UI at 
> http://192.168.134.10:4040
> 15/10/12 16:31:12 INFO DAGScheduler: Stopping DAGScheduler
> 15/10/12 16:31:12 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 15/10/12 16:31:12 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 15/10/12 16:31:12 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 15/10/12 16:31:12 INFO MemoryStore: MemoryStore cleared
> 15/10/12 16:31:12 INFO BlockManager: BlockManager stopped
> 15/10/12 16:31:12 INFO BlockManagerMaster: BlockManagerMaster stopped
> 15/10/12 16:31:12 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 15/10/12 16:31:12 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from localhost/127.0.0.1:7077 is closed
> 15/10/12 16:31:12 ERROR NettyRpcEnv: Exception when sending 
> RequestMessage(192.168.134.10:40548,NettyRpcEndpointRef(spark://Master@localhost:7077),UnregisterApplication(app-20151012163109-),false)
> java.io.IOException: Connection from localhost/127.0.0.1:7077 closed
> at 
> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> 

[jira] [Commented] (SPARK-11065) IOException thrown at job submit shutdown

2015-11-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020977#comment-15020977
 ] 

Sean Owen commented on SPARK-11065:
---

Unless it's causing a problem, I'd ignore it. Shutdown is inherently somewhat 
asynchronous and some components may complain if they lose a connection to 
another. In that case, the error maybe should be a warning.

> IOException thrown at job submit shutdown
> -
>
> Key: SPARK-11065
> URL: https://issues.apache.org/jira/browse/SPARK-11065
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: Jean-Baptiste Onofré
>Priority: Minor
>
> When submitted a job (for instance JavaWordCount example), even if the job 
> works fine, at the end of execution, we can see:
> {code}
> checkForCorruptJournalFiles="true": 1
> 15/10/12 16:31:12 INFO SparkUI: Stopped Spark web UI at 
> http://192.168.134.10:4040
> 15/10/12 16:31:12 INFO DAGScheduler: Stopping DAGScheduler
> 15/10/12 16:31:12 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 15/10/12 16:31:12 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 15/10/12 16:31:12 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 15/10/12 16:31:12 INFO MemoryStore: MemoryStore cleared
> 15/10/12 16:31:12 INFO BlockManager: BlockManager stopped
> 15/10/12 16:31:12 INFO BlockManagerMaster: BlockManagerMaster stopped
> 15/10/12 16:31:12 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 15/10/12 16:31:12 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from localhost/127.0.0.1:7077 is closed
> 15/10/12 16:31:12 ERROR NettyRpcEnv: Exception when sending 
> RequestMessage(192.168.134.10:40548,NettyRpcEndpointRef(spark://Master@localhost:7077),UnregisterApplication(app-20151012163109-),false)
> java.io.IOException: Connection from localhost/127.0.0.1:7077 closed
> at 
> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 15/10/12 16:31:12 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/10/12 16:31:12 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> 15/10/12 16:31:12 INFO SparkContext: Successfully stopped SparkContext
> 15/10/12 16:31:12 INFO ShutdownHookManager: Shutdown hook called
> 15/10/12 16:31:12 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-81bc4324-1268-4e54-bdd2-f7a2a36dafd4
> {code}
> I gonna investigate about that and I will submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-11906) Speculation Tasks Cause ProgressBar UI Overflow

2015-11-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020990#comment-15020990
 ] 

Sean Owen commented on SPARK-11906:
---

Yes, can you open a PR? sounds like you already identified the problem.

> Speculation Tasks Cause ProgressBar UI Overflow
> ---
>
> Key: SPARK-11906
> URL: https://issues.apache.org/jira/browse/SPARK-11906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Sen Fang
>Priority: Trivial
>
> When there are speculative tasks in stage, the started tasks + completed 
> tasks can be greater than total number of tasks. It leads to the started 
> progress block to overflow to next line. Visually the light blue progress 
> block becomes no longer visible when it happens.
> The fix should be as trivial as to cap the number of started task by total - 
> completed task.
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L322



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11909) Spark Standalone's master URL accepts URLs without port (assuming default 7077)

2015-11-22 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021153#comment-15021153
 ] 

Jacek Laskowski commented on SPARK-11909:
-

_"The default is not a well-known port like 80 for HTTP"_ - that's exactly the 
reason why I filed the issue. Since it's not well-known it's hard to remember 
it and hence not very easy for people new to Spark. I experienced the mental 
"pain" today when I started Spark Standalone and had to remember the number to 
create SparkContext properly. Less to remember => less confusion => more happy 
users.

> Spark Standalone's master URL accepts URLs without port (assuming default 
> 7077)
> ---
>
> Key: SPARK-11909
> URL: https://issues.apache.org/jira/browse/SPARK-11909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It's currently impossible to use {{spark://localhost}} URL for Spark 
> Standalone's master. With the feature supported, it'd be less to know to get 
> started with the mode (and hence improve user friendliness).
> I think no-port master URL should be supported and assume the default port 
> {{7077}}.
> {code}
> org.apache.spark.SparkException: Invalid master URL: spark://localhost
>   at 
> org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2088)
>   at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.deploy.client.AppClient.(AppClient.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.start(SparkDeploySchedulerBackend.scala:93)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11909) Spark Standalone's master URL accepts URLs without port (assuming default 7077)

2015-11-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021167#comment-15021167
 ] 

Sean Owen commented on SPARK-11909:
---

I think that cuts the other way. You're helping people not think about what 
port the master they're talking to is running on, which is probably more 
confusing than explicitly stating the port, especially if you accidentally talk 
to the wrong one somehow.

> Spark Standalone's master URL accepts URLs without port (assuming default 
> 7077)
> ---
>
> Key: SPARK-11909
> URL: https://issues.apache.org/jira/browse/SPARK-11909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It's currently impossible to use {{spark://localhost}} URL for Spark 
> Standalone's master. With the feature supported, it'd be less to know to get 
> started with the mode (and hence improve user friendliness).
> I think no-port master URL should be supported and assume the default port 
> {{7077}}.
> {code}
> org.apache.spark.SparkException: Invalid master URL: spark://localhost
>   at 
> org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2088)
>   at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.deploy.client.AppClient.(AppClient.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.start(SparkDeploySchedulerBackend.scala:93)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11783) When deployed against remote Hive metastore, HiveContext.executionHive points to wrong metastore

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021082#comment-15021082
 ] 

Apache Spark commented on SPARK-11783:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9895

> When deployed against remote Hive metastore, HiveContext.executionHive points 
> to wrong metastore
> 
>
> Key: SPARK-11783
> URL: https://issues.apache.org/jira/browse/SPARK-11783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 1.7.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> When using remote metastore, execution Hive client somehow is initialized to 
> point to the actual remote metastore instead of the dummy local Derby 
> metastore.
> To reproduce this issue:
> # Configuring {{conf/hive-site.xml}} to point to a remote Hive 1.2.1 
> metastore.
> # Set {{hive.metastore.uris}} to {{thrift://localhost:9083}}.
> # Start metastore service using {{$HIVE_HOME/bin/hive --service metastore}}
> # Start Thrift server with remote debugging options
> # Attach the debugger to the Thrift server driver process, we can verify that 
> {{executionHive}} points to the remote metastore rather than the local 
> execution Derby metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11847) Model export/import for spark.ml: LDA

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11847:


Assignee: yuhao yang  (was: Apache Spark)

> Model export/import for spark.ml: LDA
> -
>
> Key: SPARK-11847
> URL: https://issues.apache.org/jira/browse/SPARK-11847
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
>
> Add read/write support to LDA, similar to ALS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11783) When deployed against remote Hive metastore, HiveContext.executionHive points to wrong metastore

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11783:


Assignee: Cheng Lian  (was: Apache Spark)

> When deployed against remote Hive metastore, HiveContext.executionHive points 
> to wrong metastore
> 
>
> Key: SPARK-11783
> URL: https://issues.apache.org/jira/browse/SPARK-11783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 1.7.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> When using remote metastore, execution Hive client somehow is initialized to 
> point to the actual remote metastore instead of the dummy local Derby 
> metastore.
> To reproduce this issue:
> # Configuring {{conf/hive-site.xml}} to point to a remote Hive 1.2.1 
> metastore.
> # Set {{hive.metastore.uris}} to {{thrift://localhost:9083}}.
> # Start metastore service using {{$HIVE_HOME/bin/hive --service metastore}}
> # Start Thrift server with remote debugging options
> # Attach the debugger to the Thrift server driver process, we can verify that 
> {{executionHive}} points to the remote metastore rather than the local 
> execution Derby metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11847) Model export/import for spark.ml: LDA

2015-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021083#comment-15021083
 ] 

Apache Spark commented on SPARK-11847:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/9894

> Model export/import for spark.ml: LDA
> -
>
> Key: SPARK-11847
> URL: https://issues.apache.org/jira/browse/SPARK-11847
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
>
> Add read/write support to LDA, similar to ALS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11783) When deployed against remote Hive metastore, HiveContext.executionHive points to wrong metastore

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11783:


Assignee: Apache Spark  (was: Cheng Lian)

> When deployed against remote Hive metastore, HiveContext.executionHive points 
> to wrong metastore
> 
>
> Key: SPARK-11783
> URL: https://issues.apache.org/jira/browse/SPARK-11783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 1.7.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Critical
>
> When using remote metastore, execution Hive client somehow is initialized to 
> point to the actual remote metastore instead of the dummy local Derby 
> metastore.
> To reproduce this issue:
> # Configuring {{conf/hive-site.xml}} to point to a remote Hive 1.2.1 
> metastore.
> # Set {{hive.metastore.uris}} to {{thrift://localhost:9083}}.
> # Start metastore service using {{$HIVE_HOME/bin/hive --service metastore}}
> # Start Thrift server with remote debugging options
> # Attach the debugger to the Thrift server driver process, we can verify that 
> {{executionHive}} points to the remote metastore rather than the local 
> execution Derby metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11847) Model export/import for spark.ml: LDA

2015-11-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11847:


Assignee: Apache Spark  (was: yuhao yang)

> Model export/import for spark.ml: LDA
> -
>
> Key: SPARK-11847
> URL: https://issues.apache.org/jira/browse/SPARK-11847
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Add read/write support to LDA, similar to ALS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4514) SparkContext localProperties does not inherit property updates across thread reuse

2015-11-22 Thread Richard W. Eggert II (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021185#comment-15021185
 ] 

Richard W. Eggert II commented on SPARK-4514:
-

The unit test attached to this issue fails in master, but passes in 
https://github.com/apache/spark/pull/9264

> SparkContext localProperties does not inherit property updates across thread 
> reuse
> --
>
> Key: SPARK-4514
> URL: https://issues.apache.org/jira/browse/SPARK-4514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Erik Erlandson
>Assignee: Josh Rosen
>Priority: Critical
>
> The current job group id of a Spark context is stored in the 
> {{localProperties}} member value.   This data structure is designed to be 
> thread local, and its settings are not preserved when {{ComplexFutureAction}} 
> instantiates a new {{Future}}.  
> One consequence of this is that {{takeAsync()}} does not behave in the same 
> way as other async actions, e.g. {{countAsync()}}.  For example, this test 
> (if copied into StatusTrackerSuite.scala), will fail, because 
> {{"my-job-group2"}} is not propagated to the Future which actually 
> instantiates the job:
> {code:java}
>   test("getJobIdsForGroup() with takeAsync()") {
> sc = new SparkContext("local", "test", new SparkConf(false))
> sc.setJobGroup("my-job-group2", "description")
> sc.statusTracker.getJobIdsForGroup("my-job-group2") should be (Seq.empty)
> val firstJobFuture = sc.parallelize(1 to 1000, 1).takeAsync(1)
> val firstJobId = eventually(timeout(10 seconds)) {
>   firstJobFuture.jobIds.head
> }
> eventually(timeout(10 seconds)) {
>   sc.statusTracker.getJobIdsForGroup("my-job-group2") should be 
> (Seq(firstJobId))
> }
>   }
> {code}
> It also impacts current PR for SPARK-1021, which involves additional uses of 
> {{ComplexFutureAction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11909) Spark Standalone's master URL accepts URLs without port (assuming default 7077)

2015-11-22 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021186#comment-15021186
 ] 

Jacek Laskowski commented on SPARK-11909:
-

What about a WARN message about the port in use to connect to a Spark 
Standalone master for users who need less to remember and type like me? It'd be 
a nice time saver. That would at the _very_ least spare the "recommendation" at 
http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually
 which is actually false (as the master doesn't print out the URL to the 
console once started):

_Once started, the master will print out a spark://HOST:PORT URL for itself, 
which you can use to connect workers to it, or pass as the “master” argument to 
SparkContext. You can also find this URL on the master’s web UI, which is 
http://localhost:8080 by default._

> Spark Standalone's master URL accepts URLs without port (assuming default 
> 7077)
> ---
>
> Key: SPARK-11909
> URL: https://issues.apache.org/jira/browse/SPARK-11909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It's currently impossible to use {{spark://localhost}} URL for Spark 
> Standalone's master. With the feature supported, it'd be less to know to get 
> started with the mode (and hence improve user friendliness).
> I think no-port master URL should be supported and assume the default port 
> {{7077}}.
> {code}
> org.apache.spark.SparkException: Invalid master URL: spark://localhost
>   at 
> org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2088)
>   at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.deploy.client.AppClient.(AppClient.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.start(SparkDeploySchedulerBackend.scala:93)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11909) Spark Standalone's master URL accepts URLs without port (assuming default 7077)

2015-11-22 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021186#comment-15021186
 ] 

Jacek Laskowski edited comment on SPARK-11909 at 11/22/15 8:51 PM:
---

What about a WARN message about the port in use to connect to a Spark 
Standalone master for users who need less to remember and type like me? It'd be 
a nice time saver. That would at the _very_ least spare the "recommendation" at 
http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually
 which is actually false (as the master doesn't print out the URL to the 
console once started):

{quote}
Once started, the master will print out a spark://HOST:PORT URL for itself, 
which you can use to connect workers to it, or pass as the “master” argument to 
SparkContext. You can also find this URL on the master’s web UI, which is 
http://localhost:8080 by default.
{quote}


was (Author: jlaskowski):
What about a WARN message about the port in use to connect to a Spark 
Standalone master for users who need less to remember and type like me? It'd be 
a nice time saver. That would at the _very_ least spare the "recommendation" at 
http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually
 which is actually false (as the master doesn't print out the URL to the 
console once started):

_Once started, the master will print out a spark://HOST:PORT URL for itself, 
which you can use to connect workers to it, or pass as the “master” argument to 
SparkContext. You can also find this URL on the master’s web UI, which is 
http://localhost:8080 by default._

> Spark Standalone's master URL accepts URLs without port (assuming default 
> 7077)
> ---
>
> Key: SPARK-11909
> URL: https://issues.apache.org/jira/browse/SPARK-11909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It's currently impossible to use {{spark://localhost}} URL for Spark 
> Standalone's master. With the feature supported, it'd be less to know to get 
> started with the mode (and hence improve user friendliness).
> I think no-port master URL should be supported and assume the default port 
> {{7077}}.
> {code}
> org.apache.spark.SparkException: Invalid master URL: spark://localhost
>   at 
> org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2088)
>   at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.deploy.client.AppClient.(AppClient.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.start(SparkDeploySchedulerBackend.scala:93)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4514) SparkContext localProperties does not inherit property updates across thread reuse

2015-11-22 Thread Richard W. Eggert II (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021185#comment-15021185
 ] 

Richard W. Eggert II edited comment on SPARK-4514 at 11/22/15 8:54 PM:
---

The unit test attached to this issue fails in master, but passes in 
https://github.com/apache/spark/pull/9264 , which is intended to fix SPARK-9026.


was (Author: reggert1980):
The unit test attached to this issue fails in master, but passes in 
https://github.com/apache/spark/pull/9264

> SparkContext localProperties does not inherit property updates across thread 
> reuse
> --
>
> Key: SPARK-4514
> URL: https://issues.apache.org/jira/browse/SPARK-4514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Erik Erlandson
>Assignee: Josh Rosen
>Priority: Critical
>
> The current job group id of a Spark context is stored in the 
> {{localProperties}} member value.   This data structure is designed to be 
> thread local, and its settings are not preserved when {{ComplexFutureAction}} 
> instantiates a new {{Future}}.  
> One consequence of this is that {{takeAsync()}} does not behave in the same 
> way as other async actions, e.g. {{countAsync()}}.  For example, this test 
> (if copied into StatusTrackerSuite.scala), will fail, because 
> {{"my-job-group2"}} is not propagated to the Future which actually 
> instantiates the job:
> {code:java}
>   test("getJobIdsForGroup() with takeAsync()") {
> sc = new SparkContext("local", "test", new SparkConf(false))
> sc.setJobGroup("my-job-group2", "description")
> sc.statusTracker.getJobIdsForGroup("my-job-group2") should be (Seq.empty)
> val firstJobFuture = sc.parallelize(1 to 1000, 1).takeAsync(1)
> val firstJobId = eventually(timeout(10 seconds)) {
>   firstJobFuture.jobIds.head
> }
> eventually(timeout(10 seconds)) {
>   sc.statusTracker.getJobIdsForGroup("my-job-group2") should be 
> (Seq(firstJobId))
> }
>   }
> {code}
> It also impacts current PR for SPARK-1021, which involves additional uses of 
> {{ComplexFutureAction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4514) SparkContext localProperties does not inherit property updates across thread reuse

2015-11-22 Thread Richard W. Eggert II (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021187#comment-15021187
 ] 

Richard W. Eggert II commented on SPARK-4514:
-

This test, however, still fails:

{code}
 test("getJobIdsForGroup() with takeAsync() across multiple partitions") {
sc = new SparkContext("local", "test", new SparkConf(false))
sc.setJobGroup("my-job-group2", "description")
sc.statusTracker.getJobIdsForGroup("my-job-group2") shouldBe empty
val firstJobFuture = sc.parallelize(1 to 1000, 2).takeAsync(999)
val firstJobId = eventually(timeout(10 seconds)) {
  firstJobFuture.jobIds.head
}
eventually(timeout(10 seconds)) {
  sc.statusTracker.getJobIdsForGroup("my-job-group2") should have size 2
}
  }
{code}

> SparkContext localProperties does not inherit property updates across thread 
> reuse
> --
>
> Key: SPARK-4514
> URL: https://issues.apache.org/jira/browse/SPARK-4514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Erik Erlandson
>Assignee: Josh Rosen
>Priority: Critical
>
> The current job group id of a Spark context is stored in the 
> {{localProperties}} member value.   This data structure is designed to be 
> thread local, and its settings are not preserved when {{ComplexFutureAction}} 
> instantiates a new {{Future}}.  
> One consequence of this is that {{takeAsync()}} does not behave in the same 
> way as other async actions, e.g. {{countAsync()}}.  For example, this test 
> (if copied into StatusTrackerSuite.scala), will fail, because 
> {{"my-job-group2"}} is not propagated to the Future which actually 
> instantiates the job:
> {code:java}
>   test("getJobIdsForGroup() with takeAsync()") {
> sc = new SparkContext("local", "test", new SparkConf(false))
> sc.setJobGroup("my-job-group2", "description")
> sc.statusTracker.getJobIdsForGroup("my-job-group2") should be (Seq.empty)
> val firstJobFuture = sc.parallelize(1 to 1000, 1).takeAsync(1)
> val firstJobId = eventually(timeout(10 seconds)) {
>   firstJobFuture.jobIds.head
> }
> eventually(timeout(10 seconds)) {
>   sc.statusTracker.getJobIdsForGroup("my-job-group2") should be 
> (Seq(firstJobId))
> }
>   }
> {code}
> It also impacts current PR for SPARK-1021, which involves additional uses of 
> {{ComplexFutureAction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4514) SparkContext localProperties does not inherit property updates across thread reuse

2015-11-22 Thread Richard W. Eggert II (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021194#comment-15021194
 ] 

Richard W. Eggert II commented on SPARK-4514:
-

I implemented a two-line fix that causes this test to now pass in that PR.

> SparkContext localProperties does not inherit property updates across thread 
> reuse
> --
>
> Key: SPARK-4514
> URL: https://issues.apache.org/jira/browse/SPARK-4514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Erik Erlandson
>Assignee: Josh Rosen
>Priority: Critical
>
> The current job group id of a Spark context is stored in the 
> {{localProperties}} member value.   This data structure is designed to be 
> thread local, and its settings are not preserved when {{ComplexFutureAction}} 
> instantiates a new {{Future}}.  
> One consequence of this is that {{takeAsync()}} does not behave in the same 
> way as other async actions, e.g. {{countAsync()}}.  For example, this test 
> (if copied into StatusTrackerSuite.scala), will fail, because 
> {{"my-job-group2"}} is not propagated to the Future which actually 
> instantiates the job:
> {code:java}
>   test("getJobIdsForGroup() with takeAsync()") {
> sc = new SparkContext("local", "test", new SparkConf(false))
> sc.setJobGroup("my-job-group2", "description")
> sc.statusTracker.getJobIdsForGroup("my-job-group2") should be (Seq.empty)
> val firstJobFuture = sc.parallelize(1 to 1000, 1).takeAsync(1)
> val firstJobId = eventually(timeout(10 seconds)) {
>   firstJobFuture.jobIds.head
> }
> eventually(timeout(10 seconds)) {
>   sc.statusTracker.getJobIdsForGroup("my-job-group2") should be 
> (Seq(firstJobId))
> }
>   }
> {code}
> It also impacts current PR for SPARK-1021, which involves additional uses of 
> {{ComplexFutureAction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >