date:20151102


 [ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9836:
---

Assignee: Yanbo Liang  (was: Apache Spark)

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver


 [ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9836:
---

Assignee: Apache Spark  (was: Yanbo Liang)

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver


[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985258#comment-14985258
 ] 

Apache Spark commented on SPARK-9836:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9413

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11450) Add support for UnsafeRow to Expand


 [ 
https://issues.apache.org/jira/browse/SPARK-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11450:


Assignee: Apache Spark

> Add support for UnsafeRow to Expand
> ---
>
> Key: SPARK-11450
> URL: https://issues.apache.org/jira/browse/SPARK-11450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Minor
>
> Add support for UnsafeRow in the Expand operator. This should be trivial to 
> accomplish, and this save us from doing a ConvertToSafe step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-11-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985433#comment-14985433
 ] 

Sean Owen commented on SPARK-11371:
---

I don't feel strongly about it, and it's not really my area. My instinct is to 
not support keywords that are generally not supported in SQL dialects, in 
Spark's SQL dialect, if possible. 

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-11-02 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985424#comment-14985424
 ] 

Ted Yu commented on SPARK-11371:


[~sowen]:
Do you think it is worth adding the alias ?

Thanks

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-02 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-10978:
--

Assignee: Cheng Lian

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Assignee: Cheng Lian
>Priority: Critical
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11458) add word count example for Dataset

2015-11-02 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-11458:
---

 Summary: add word count example for Dataset
 Key: SPARK-11458
 URL: https://issues.apache.org/jira/browse/SPARK-11458
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11458) add word count example for Dataset


[ 
https://issues.apache.org/jira/browse/SPARK-11458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985324#comment-14985324
 ] 

Apache Spark commented on SPARK-11458:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9415

> add word count example for Dataset
> --
>
> Key: SPARK-11458
> URL: https://issues.apache.org/jira/browse/SPARK-11458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11458) add word count example for Dataset


 [ 
https://issues.apache.org/jira/browse/SPARK-11458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11458:


Assignee: (was: Apache Spark)

> add word count example for Dataset
> --
>
> Key: SPARK-11458
> URL: https://issues.apache.org/jira/browse/SPARK-11458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11458) add word count example for Dataset


 [ 
https://issues.apache.org/jira/browse/SPARK-11458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11458:


Assignee: Apache Spark

> add word count example for Dataset
> --
>
> Key: SPARK-11458
> URL: https://issues.apache.org/jira/browse/SPARK-11458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9357) Remove JoinedRow

2015-11-02 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985264#comment-14985264
 ] 

Herman van Hovell commented on SPARK-9357:
--

Is this ticket still relevant?

> Remove JoinedRow
> 
>
> Key: SPARK-9357
> URL: https://issues.apache.org/jira/browse/SPARK-9357
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> JoinedRow was introduced to join two rows together, in aggregation (join key 
> and value), joins (left, right), window functions, etc.
> It aims to reduce the amount of data copied, but incurs branches when the row 
> is actually read. Given all the fields will be read almost all the time 
> (otherwise they get pruned out by the optimizer), branch predictor cannot do 
> anything about those branches.
> I think a better way is just to remove this thing, and materializes the row 
> data directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService


 [ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6373:
---

Assignee: Apache Spark

> Add SSL/TLS for the Netty based BlockTransferService 
> -
>
> Key: SPARK-6373
> URL: https://issues.apache.org/jira/browse/SPARK-6373
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle
>Affects Versions: 1.2.1
>Reporter: Jeffrey Turpin
>Assignee: Apache Spark
>
> Add the ability to allow for secure communications (SSL/TLS) for the Netty 
> based BlockTransferService and the ExternalShuffleClient. This ticket will 
> hopefully start the conversation around potential designs... Below is a 
> reference to a WIP prototype which implements this functionality 
> (prototype)... I have attempted to disrupt as little code as possible and 
> tried to follow the current code structure (for the most part) in the areas I 
> modified. I also studied how Hadoop achieves encrypted shuffle 
> (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService


[ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985412#comment-14985412
 ] 

Apache Spark commented on SPARK-6373:
-

User 'turp1twin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9416

> Add SSL/TLS for the Netty based BlockTransferService 
> -
>
> Key: SPARK-6373
> URL: https://issues.apache.org/jira/browse/SPARK-6373
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle
>Affects Versions: 1.2.1
>Reporter: Jeffrey Turpin
>
> Add the ability to allow for secure communications (SSL/TLS) for the Netty 
> based BlockTransferService and the ExternalShuffleClient. This ticket will 
> hopefully start the conversation around potential designs... Below is a 
> reference to a WIP prototype which implements this functionality 
> (prototype)... I have attempted to disrupt as little code as possible and 
> tried to follow the current code structure (for the most part) in the areas I 
> modified. I also studied how Hadoop achieves encrypted shuffle 
> (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService


 [ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6373:
---

Assignee: (was: Apache Spark)

> Add SSL/TLS for the Netty based BlockTransferService 
> -
>
> Key: SPARK-6373
> URL: https://issues.apache.org/jira/browse/SPARK-6373
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle
>Affects Versions: 1.2.1
>Reporter: Jeffrey Turpin
>
> Add the ability to allow for secure communications (SSL/TLS) for the Netty 
> based BlockTransferService and the ExternalShuffleClient. This ticket will 
> hopefully start the conversation around potential designs... Below is a 
> reference to a WIP prototype which implements this functionality 
> (prototype)... I have attempted to disrupt as little code as possible and 
> tried to follow the current code structure (for the most part) in the areas I 
> modified. I also studied how Hadoop achieves encrypted shuffle 
> (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11450) Add support for UnsafeRow to Expand


[ 
https://issues.apache.org/jira/browse/SPARK-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985262#comment-14985262
 ] 

Apache Spark commented on SPARK-11450:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/9414

> Add support for UnsafeRow to Expand
> ---
>
> Key: SPARK-11450
> URL: https://issues.apache.org/jira/browse/SPARK-11450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Priority: Minor
>
> Add support for UnsafeRow in the Expand operator. This should be trivial to 
> accomplish, and this save us from doing a ConvertToSafe step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11450) Add support for UnsafeRow to Expand


 [ 
https://issues.apache.org/jira/browse/SPARK-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11450:


Assignee: (was: Apache Spark)

> Add support for UnsafeRow to Expand
> ---
>
> Key: SPARK-11450
> URL: https://issues.apache.org/jira/browse/SPARK-11450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Priority: Minor
>
> Add support for UnsafeRow in the Expand operator. This should be trivial to 
> accomplish, and this save us from doing a ConvertToSafe step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11449) PortableDataStream should be a factory


 [ 
https://issues.apache.org/jira/browse/SPARK-11449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11449:


Assignee: Apache Spark

> PortableDataStream should be a factory
> --
>
> Key: SPARK-11449
> URL: https://issues.apache.org/jira/browse/SPARK-11449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Minor
>
> {{PortableDataStream}}'s close behavior caught me by surprise the other day. 
> I assumed incorrectly that closing the inputstream it provides would also 
> close the {{PortableDataStream}}. This leads to quite a confusing situation 
> in when you try to reuse the {{PortableDataStream}}: the state of the 
> {{PortableDataStream}} indicates that it is open, whereas the underlying 
> inputstream is actually closed.
> I'd like either to improve the documentation, or add an {{InputStream}} 
> wrapper that closes the {{PortableDataStream}} when you close the 
> {{InputStream}}. Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11449) PortableDataStream should be a factory


 [ 
https://issues.apache.org/jira/browse/SPARK-11449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11449:


Assignee: (was: Apache Spark)

> PortableDataStream should be a factory
> --
>
> Key: SPARK-11449
> URL: https://issues.apache.org/jira/browse/SPARK-11449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Herman van Hovell
>Priority: Minor
>
> {{PortableDataStream}}'s close behavior caught me by surprise the other day. 
> I assumed incorrectly that closing the inputstream it provides would also 
> close the {{PortableDataStream}}. This leads to quite a confusing situation 
> in when you try to reuse the {{PortableDataStream}}: the state of the 
> {{PortableDataStream}} indicates that it is open, whereas the underlying 
> inputstream is actually closed.
> I'd like either to improve the documentation, or add an {{InputStream}} 
> wrapper that closes the {{PortableDataStream}} when you close the 
> {{InputStream}}. Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-02 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985569#comment-14985569
 ] 

kevin yu commented on SPARK-11447:
--

Hello Kapil:

When you say Doesn't work, does it mean that you got exception? 
 
I tried spark 1.5 , and this scala code works for me. 
Can you verify which spark version you are running? I saw you put 1.5.1 there. 

//DOESN'T WORK
val filteredDF = df.filter(df("column") <=> (new Column(Literal(null

I run this on my spark shell at the latest version

scala> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

thanks.



> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
>

[jira] [Commented] (SPARK-11457) Yarn AM proxy filter configuration should be reloaded when recovered from checkpoint


[ 
https://issues.apache.org/jira/browse/SPARK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985187#comment-14985187
 ] 

Apache Spark commented on SPARK-11457:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/9412

> Yarn AM proxy filter configuration should be reloaded when recovered from 
> checkpoint
> 
>
> Key: SPARK-11457
> URL: https://issues.apache.org/jira/browse/SPARK-11457
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.5.1
>Reporter: Saisai Shao
>
> Currently Yarn AM proxy filter configuration is recovered from checkpoint 
> file when Spark Streaming application is restarted, which will lead to some 
> unwanted behaviors:
> 1. Wrong RM address if RM is redeployed from failure.
> 2. Wrong proxyBase, since app id is updated, old app id for proxyBase is 
> wrong.
> So instead of recovering from checkpoint file, these configurations should be 
> reloaded each time when app started.
> This problem only exists in Yarn cluster mode, for Yarn client mode, these 
> configurations will be updated with RPC message {{AddWebUIFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11457) Yarn AM proxy filter configuration should be reloaded when recovered from checkpoint


 [ 
https://issues.apache.org/jira/browse/SPARK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11457:


Assignee: Apache Spark

> Yarn AM proxy filter configuration should be reloaded when recovered from 
> checkpoint
> 
>
> Key: SPARK-11457
> URL: https://issues.apache.org/jira/browse/SPARK-11457
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.5.1
>Reporter: Saisai Shao
>Assignee: Apache Spark
>
> Currently Yarn AM proxy filter configuration is recovered from checkpoint 
> file when Spark Streaming application is restarted, which will lead to some 
> unwanted behaviors:
> 1. Wrong RM address if RM is redeployed from failure.
> 2. Wrong proxyBase, since app id is updated, old app id for proxyBase is 
> wrong.
> So instead of recovering from checkpoint file, these configurations should be 
> reloaded each time when app started.
> This problem only exists in Yarn cluster mode, for Yarn client mode, these 
> configurations will be updated with RPC message {{AddWebUIFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11457) Yarn AM proxy filter configuration should be reloaded when recovered from checkpoint


 [ 
https://issues.apache.org/jira/browse/SPARK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11457:


Assignee: (was: Apache Spark)

> Yarn AM proxy filter configuration should be reloaded when recovered from 
> checkpoint
> 
>
> Key: SPARK-11457
> URL: https://issues.apache.org/jira/browse/SPARK-11457
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.5.1
>Reporter: Saisai Shao
>
> Currently Yarn AM proxy filter configuration is recovered from checkpoint 
> file when Spark Streaming application is restarted, which will lead to some 
> unwanted behaviors:
> 1. Wrong RM address if RM is redeployed from failure.
> 2. Wrong proxyBase, since app id is updated, old app id for proxyBase is 
> wrong.
> So instead of recovering from checkpoint file, these configurations should be 
> reloaded each time when app started.
> This problem only exists in Yarn cluster mode, for Yarn client mode, these 
> configurations will be updated with RPC message {{AddWebUIFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10997) Netty-based RPC env should support a "client-only" mode.

2015-11-02 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10997.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.6.0

> Netty-based RPC env should support a "client-only" mode.
> 
>
> Key: SPARK-10997
> URL: https://issues.apache.org/jira/browse/SPARK-10997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.6.0
>
>
> The new netty RPC still behaves too much like akka; it requires both client 
> (e.g. an executor) and server (e.g. the driver) to listen for incoming 
> connections.
> That is not necessary, since sockets are full-duplex and RPCs should be able 
> to flow either way on any connection. Also, because the semantics of the 
> netty-based RPC don't exactly match akka, you get weird issues like 
> SPARK-10987.
> Supporting a client-only mode also reduces the number of ports Spark apps 
> need to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11311) spark cannot describe temporary functions

2015-11-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11311:
--
Assignee: Adrian Wang

> spark cannot describe temporary functions
> -
>
> Key: SPARK-11311
> URL: https://issues.apache.org/jira/browse/SPARK-11311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Adrian Wang
> Fix For: 1.6.0
>
>
> create temporary function aa as ;
> describe function aa;
> Will return 'Unable to find function aa', which is not right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9817) Improve the container placement strategy by considering the localities of pending container requests

2015-11-02 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-9817.
---
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 1.6.0

> Improve the container placement strategy by considering the localities of 
> pending container requests
> 
>
> Key: SPARK-9817
> URL: https://issues.apache.org/jira/browse/SPARK-9817
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 1.6.0
>
>
> Current implementation does not consider the localities of pending container 
> requests, since required locality preferences of tasks will be shifted time 
> to time. It is better to discard outdated container request and recalculate 
> with container placement strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11437) createDataFrame shouldn't .take() when provided schema


 [ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11437.

   Resolution: Fixed
Fix Version/s: 1.6.1

Issue resolved by pull request 9392
[https://github.com/apache/spark/pull/9392]

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
> Fix For: 1.6.1
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11437) createDataFrame shouldn't .take() when provided schema


 [ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-11437:
---
Fix Version/s: (was: 1.6.1)
   1.6.0

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-11-02 Thread Stefano Baghino (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985755#comment-14985755
 ] 

Stefano Baghino commented on SPARK-7425:


I would like to contribute on this issue. Is there an active PR I can somehow 
contribute commits to?

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11333) Add the receiver's executor information to UI


 [ 
https://issues.apache.org/jira/browse/SPARK-11333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11333:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add the receiver's executor information to UI
> -
>
> Key: SPARK-11333
> URL: https://issues.apache.org/jira/browse/SPARK-11333
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SPARK-11212 added the receiver's executor information internally. We can 
> expose it to ReceiverInfo and UI since it's helpful when there are multiple 
> executors running in the same host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11333) Add the receiver's executor information to UI


[ 
https://issues.apache.org/jira/browse/SPARK-11333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985756#comment-14985756
 ] 

Apache Spark commented on SPARK-11333:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9418

> Add the receiver's executor information to UI
> -
>
> Key: SPARK-11333
> URL: https://issues.apache.org/jira/browse/SPARK-11333
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SPARK-11212 added the receiver's executor information internally. We can 
> expose it to ReceiverInfo and UI since it's helpful when there are multiple 
> executors running in the same host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11333) Add the receiver's executor information to UI


 [ 
https://issues.apache.org/jira/browse/SPARK-11333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11333:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add the receiver's executor information to UI
> -
>
> Key: SPARK-11333
> URL: https://issues.apache.org/jira/browse/SPARK-11333
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> SPARK-11212 added the receiver's executor information internally. We can 
> expose it to ReceiverInfo and UI since it's helpful when there are multiple 
> executors running in the same host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11371) Make "mean" an alias for "avg" operator


[ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985651#comment-14985651
 ] 

Yin Huai commented on SPARK-11371:
--

Since DataFrame API provides it as a function, I think it is fine to add it in 
function registry. 

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11198) Support record de-aggregation in KinesisReceiver

2015-11-02 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985701#comment-14985701
 ] 

Burak Yavuz commented on SPARK-11198:
-

Just tested this. It works during regular operation, but doesn't de-aggregate 
during recovery. The PR I added should de-aggregate in recovery.

> Support record de-aggregation in KinesisReceiver
> 
>
> Key: SPARK-11198
> URL: https://issues.apache.org/jira/browse/SPARK-11198
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> We need to check/implement the support for record de-aggregation and 
> subsequence number. This is the documentation from AWS:
> http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-kpl-consumer-deaggregation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11398) misleading dialect conf at the start of spark-sql


 [ 
https://issues.apache.org/jira/browse/SPARK-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-11398:
-
Description: 
1. def dialectClassName in HiveContext is unnecessary. 
In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
HiveQLDialect(this);
else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls 
dialectClassName, which is overriden in HiveContext and still return 
super.dialectClassName.
So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def 
dialectClassName in HiveContext.

2. When we start bin/spark-sql, the default context is HiveContext, and the 
corresponding dialect is hiveql.
However, if we type "set spark.sql.dialect;", the result is "sql", which is 
inconsistent with the actual dialect and is misleading. For example, we can use 
sql like "create table" which is only allowed in hiveql, but this dialect conf 
shows it's "sql".
Although this problem will not cause any execution error, it's misleading to 
spark sql users. Therefore I think we should fix it.
In this pr, instead of overriding def dialect in conf of HiveContext, I set the 
SQLConf.DIALECT directly as "hiveql", such that result of "set 
spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can still 
use "sql" as the dialect in HiveContext through "set spark.sql.dialect=sql". 
Then the conf.dialect in HiveContext will become sql. Because in SQLConf, def 
dialect = getConf(), and now the dialect in "settings" becomes "sql".

  was:
When we start bin/spark-sql, the default context is HiveContext, and the 
corresponding dialect is hiveql.
However, if we type "set spark.sql.dialect;", the result is "sql", which is 
inconsistent with the actual dialect and is misleading. For example, we can 
create tables which is only allowed in hiveql, but this dialect conf shows it's 
"sql".
Although this problem will not cause any execution error, it's misleading to 
spark sql users. Therefore I think we should fix it.


> misleading dialect conf at the start of spark-sql
> -
>
> Key: SPARK-11398
> URL: https://issues.apache.org/jira/browse/SPARK-11398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhenhua Wang
>Priority: Minor
>
> 1. def dialectClassName in HiveContext is unnecessary. 
> In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
> HiveQLDialect(this);
> else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it 
> calls dialectClassName, which is overriden in HiveContext and still return 
> super.dialectClassName.
> So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of 
> def dialectClassName in HiveContext.
> 2. When we start bin/spark-sql, the default context is HiveContext, and the 
> corresponding dialect is hiveql.
> However, if we type "set spark.sql.dialect;", the result is "sql", which is 
> inconsistent with the actual dialect and is misleading. For example, we can 
> use sql like "create table" which is only allowed in hiveql, but this dialect 
> conf shows it's "sql".
> Although this problem will not cause any execution error, it's misleading to 
> spark sql users. Therefore I think we should fix it.
> In this pr, instead of overriding def dialect in conf of HiveContext, I set 
> the SQLConf.DIALECT directly as "hiveql", such that result of "set 
> spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can 
> still use "sql" as the dialect in HiveContext through "set 
> spark.sql.dialect=sql". Then the conf.dialect in HiveContext will become sql. 
> Because in SQLConf, def dialect = getConf(), and now the dialect in 
> "settings" becomes "sql".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11398) unnecessary def dialectClassName in HiveContext, and misleading dialect conf at the start of spark-sql


 [ 
https://issues.apache.org/jira/browse/SPARK-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-11398:
-
Summary: unnecessary def dialectClassName in HiveContext, and misleading 
dialect conf at the start of spark-sql  (was: misleading dialect conf at the 
start of spark-sql)

> unnecessary def dialectClassName in HiveContext, and misleading dialect conf 
> at the start of spark-sql
> --
>
> Key: SPARK-11398
> URL: https://issues.apache.org/jira/browse/SPARK-11398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhenhua Wang
>Priority: Minor
>
> 1. def dialectClassName in HiveContext is unnecessary. 
> In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
> HiveQLDialect(this);
> else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it 
> calls dialectClassName, which is overriden in HiveContext and still return 
> super.dialectClassName.
> So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of 
> def dialectClassName in HiveContext.
> 2. When we start bin/spark-sql, the default context is HiveContext, and the 
> corresponding dialect is hiveql.
> However, if we type "set spark.sql.dialect;", the result is "sql", which is 
> inconsistent with the actual dialect and is misleading. For example, we can 
> use sql like "create table" which is only allowed in hiveql, but this dialect 
> conf shows it's "sql".
> Although this problem will not cause any execution error, it's misleading to 
> spark sql users. Therefore I think we should fix it.
> In this pr, instead of overriding def dialect in conf of HiveContext, I set 
> the SQLConf.DIALECT directly as "hiveql", such that result of "set 
> spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can 
> still use "sql" as the dialect in HiveContext through "set 
> spark.sql.dialect=sql". Then the conf.dialect in HiveContext will become sql. 
> Because in SQLConf, def dialect = getConf(), and now the dialect in 
> "settings" becomes "sql".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11398) unnecessary def dialectClassName in HiveContext, and misleading dialect conf at the start of spark-sql


 [ 
https://issues.apache.org/jira/browse/SPARK-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-11398:
-
Description: 
1. def dialectClassName in HiveContext is unnecessary. 

In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
HiveQLDialect(this);
else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls 
dialectClassName, which is overriden in HiveContext and still return 
super.dialectClassName.

So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def 
dialectClassName in HiveContext.


2. When we start bin/spark-sql, the default context is HiveContext, and the 
corresponding dialect is hiveql.
However, if we type "set spark.sql.dialect;", the result is "sql", which is 
inconsistent with the actual dialect and is misleading. For example, we can use 
sql like "create table" which is only allowed in hiveql, but this dialect conf 
shows it's "sql".

Although this problem will not cause any execution error, it's misleading to 
spark sql users. Therefore I think we should fix it.

In this pr, instead of overriding def dialect in conf of HiveContext, I set the 
SQLConf.DIALECT directly as "hiveql", such that result of "set 
spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can still 
use "sql" as the dialect in HiveContext through "set spark.sql.dialect=sql". 
Then the conf.dialect in HiveContext will become sql. Because in SQLConf, def 
dialect = getConf(), and now the dialect in "settings" becomes "sql".

  was:
1. def dialectClassName in HiveContext is unnecessary. 

In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
HiveQLDialect(this);
else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls 
dialectClassName, which is overriden in HiveContext and still return 
super.dialectClassName.

So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def 
dialectClassName in HiveContext.

2. When we start bin/spark-sql, the default context is HiveContext, and the 
corresponding dialect is hiveql.
However, if we type "set spark.sql.dialect;", the result is "sql", which is 
inconsistent with the actual dialect and is misleading. For example, we can use 
sql like "create table" which is only allowed in hiveql, but this dialect conf 
shows it's "sql".

Although this problem will not cause any execution error, it's misleading to 
spark sql users. Therefore I think we should fix it.

In this pr, instead of overriding def dialect in conf of HiveContext, I set the 
SQLConf.DIALECT directly as "hiveql", such that result of "set 
spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can still 
use "sql" as the dialect in HiveContext through "set spark.sql.dialect=sql". 
Then the conf.dialect in HiveContext will become sql. Because in SQLConf, def 
dialect = getConf(), and now the dialect in "settings" becomes "sql".


> unnecessary def dialectClassName in HiveContext, and misleading dialect conf 
> at the start of spark-sql
> --
>
> Key: SPARK-11398
> URL: https://issues.apache.org/jira/browse/SPARK-11398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhenhua Wang
>Priority: Minor
>
> 1. def dialectClassName in HiveContext is unnecessary. 
> In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
> HiveQLDialect(this);
> else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it 
> calls dialectClassName, which is overriden in HiveContext and still return 
> super.dialectClassName.
> So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of 
> def dialectClassName in HiveContext.
> 2. When we start bin/spark-sql, the default context is HiveContext, and the 
> corresponding dialect is hiveql.
> However, if we type "set spark.sql.dialect;", the result is "sql", which is 
> inconsistent with the actual dialect and is misleading. For example, we can 
> use sql like "create table" which is only allowed in hiveql, but this dialect 
> conf shows it's "sql".
> Although this problem will not cause any execution error, it's misleading to 
> spark sql users. Therefore I think we should fix it.
> In this pr, instead of overriding def dialect in conf of HiveContext, I set 
> the SQLConf.DIALECT directly as "hiveql", such that result of "set 
> spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can 
> still use "sql" as the dialect in HiveContext through "set 
> spark.sql.dialect=sql". Then the conf.dialect in HiveContext will become sql. 
> Because in SQLConf, def dialect = getConf(), and now the dialect in 
> "settings" becomes "sql".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-11398) unnecessary def dialectClassName in HiveContext, and misleading dialect conf at the start of spark-sql


 [ 
https://issues.apache.org/jira/browse/SPARK-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-11398:
-
Description: 
1. def dialectClassName in HiveContext is unnecessary. 

In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
HiveQLDialect(this);
else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls 
dialectClassName, which is overriden in HiveContext and still return 
super.dialectClassName.

So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def 
dialectClassName in HiveContext.

2. When we start bin/spark-sql, the default context is HiveContext, and the 
corresponding dialect is hiveql.
However, if we type "set spark.sql.dialect;", the result is "sql", which is 
inconsistent with the actual dialect and is misleading. For example, we can use 
sql like "create table" which is only allowed in hiveql, but this dialect conf 
shows it's "sql".

Although this problem will not cause any execution error, it's misleading to 
spark sql users. Therefore I think we should fix it.

In this pr, instead of overriding def dialect in conf of HiveContext, I set the 
SQLConf.DIALECT directly as "hiveql", such that result of "set 
spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can still 
use "sql" as the dialect in HiveContext through "set spark.sql.dialect=sql". 
Then the conf.dialect in HiveContext will become sql. Because in SQLConf, def 
dialect = getConf(), and now the dialect in "settings" becomes "sql".

  was:
1. def dialectClassName in HiveContext is unnecessary. 
In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
HiveQLDialect(this);
else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls 
dialectClassName, which is overriden in HiveContext and still return 
super.dialectClassName.
So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def 
dialectClassName in HiveContext.

2. When we start bin/spark-sql, the default context is HiveContext, and the 
corresponding dialect is hiveql.
However, if we type "set spark.sql.dialect;", the result is "sql", which is 
inconsistent with the actual dialect and is misleading. For example, we can use 
sql like "create table" which is only allowed in hiveql, but this dialect conf 
shows it's "sql".
Although this problem will not cause any execution error, it's misleading to 
spark sql users. Therefore I think we should fix it.
In this pr, instead of overriding def dialect in conf of HiveContext, I set the 
SQLConf.DIALECT directly as "hiveql", such that result of "set 
spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can still 
use "sql" as the dialect in HiveContext through "set spark.sql.dialect=sql". 
Then the conf.dialect in HiveContext will become sql. Because in SQLConf, def 
dialect = getConf(), and now the dialect in "settings" becomes "sql".


> unnecessary def dialectClassName in HiveContext, and misleading dialect conf 
> at the start of spark-sql
> --
>
> Key: SPARK-11398
> URL: https://issues.apache.org/jira/browse/SPARK-11398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhenhua Wang
>Priority: Minor
>
> 1. def dialectClassName in HiveContext is unnecessary. 
> In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
> HiveQLDialect(this);
> else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it 
> calls dialectClassName, which is overriden in HiveContext and still return 
> super.dialectClassName.
> So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of 
> def dialectClassName in HiveContext.
> 2. When we start bin/spark-sql, the default context is HiveContext, and the 
> corresponding dialect is hiveql.
> However, if we type "set spark.sql.dialect;", the result is "sql", which is 
> inconsistent with the actual dialect and is misleading. For example, we can 
> use sql like "create table" which is only allowed in hiveql, but this dialect 
> conf shows it's "sql".
> Although this problem will not cause any execution error, it's misleading to 
> spark sql users. Therefore I think we should fix it.
> In this pr, instead of overriding def dialect in conf of HiveContext, I set 
> the SQLConf.DIALECT directly as "hiveql", such that result of "set 
> spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can 
> still use "sql" as the dialect in HiveContext through "set 
> spark.sql.dialect=sql". Then the conf.dialect in HiveContext will become sql. 
> Because in SQLConf, def dialect = getConf(), and now the dialect in 
> "settings" becomes "sql".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-9034) Reflect field names defined in GenericUDTF


 [ 
https://issues.apache.org/jira/browse/SPARK-9034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9034:

Assignee: Navis

> Reflect field names defined in GenericUDTF
> --
>
> Key: SPARK-9034
> URL: https://issues.apache.org/jira/browse/SPARK-9034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>Assignee: Navis
> Fix For: 1.6.0
>
>
> Hive GenericUDTF#initialize() defines field names in a returned schema though,
> the current HiveGenericUDTF drops these names.
> We might need to reflect these in a logical plan tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11275) [SQL] Incorrect results when using rollup/cube


[ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986764#comment-14986764
 ] 

Apache Spark commented on SPARK-11275:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/9429

> [SQL] Incorrect results when using rollup/cube 
> ---
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11399) Include_example should support labels to cut out different parts in one example code


 [ 
https://issues.apache.org/jira/browse/SPARK-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11399:


Assignee: (was: Apache Spark)

> Include_example should support labels to cut out different parts in one 
> example code
> 
>
> Key: SPARK-11399
> URL: https://issues.apache.org/jira/browse/SPARK-11399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>
> There are many small examples that do not need to create a single example 
> file. Take the MLlib datatype page – mllib-data-types.md – as an example, 
> code examples like creating vectors and matrices are trivial works. We can 
> merge them into one single vector/matrix creation example. Then we use labels 
> to distinguish each other, such as {% include_example .scala 
> vector_creation %}.
> The "label way" is also useful in the dialog-style code example: 
> http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11399) Include_example should support labels to cut out different parts in one example code


[ 
https://issues.apache.org/jira/browse/SPARK-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986834#comment-14986834
 ] 

Apache Spark commented on SPARK-11399:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9430

> Include_example should support labels to cut out different parts in one 
> example code
> 
>
> Key: SPARK-11399
> URL: https://issues.apache.org/jira/browse/SPARK-11399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>
> There are many small examples that do not need to create a single example 
> file. Take the MLlib datatype page – mllib-data-types.md – as an example, 
> code examples like creating vectors and matrices are trivial works. We can 
> merge them into one single vector/matrix creation example. Then we use labels 
> to distinguish each other, such as {% include_example .scala 
> vector_creation %}.
> The "label way" is also useful in the dialog-style code example: 
> http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11399) Include_example should support labels to cut out different parts in one example code


 [ 
https://issues.apache.org/jira/browse/SPARK-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11399:


Assignee: Apache Spark

> Include_example should support labels to cut out different parts in one 
> example code
> 
>
> Key: SPARK-11399
> URL: https://issues.apache.org/jira/browse/SPARK-11399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Apache Spark
>
> There are many small examples that do not need to create a single example 
> file. Take the MLlib datatype page – mllib-data-types.md – as an example, 
> code examples like creating vectors and matrices are trivial works. We can 
> merge them into one single vector/matrix creation example. Then we use labels 
> to distinguish each other, such as {% include_example .scala 
> vector_creation %}.
> The "label way" is also useful in the dialog-style code example: 
> http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11456) Remove deprecatd junit.framework in Java tests

2015-11-02 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-11456.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Remove deprecatd junit.framework in Java tests
> --
>
> Key: SPARK-11456
> URL: https://issues.apache.org/jira/browse/SPARK-11456
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.5.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Compiling tests generates a lot of warnings due to use of the old 
> {{junit.framework}} classes instead of {{org.junit}}. (And the tests in 
> question could use some touch-ups as well, like correctly putting expected 
> before actual.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-11-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11437:
--
Assignee: Jason White

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-11-02 Thread Stefano Baghino (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985884#comment-14985884
 ] 

Stefano Baghino edited comment on SPARK-7425 at 11/2/15 7:55 PM:
-

I'm checking the code and I'm not sure this solution would work: changing that 
line would break the compatibility with many calls to that function that use 
VectorUDT for the comparison. Do you agree?


was (Author: stefanobaghino):
I'm checking the code and I'm not sure this solution would work: changing that 
line would break the compatibility with many calls to that function that user 
VectorUDT for the comparison. Do you agree?

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3474) The env variable SPARK_MASTER_IP does not work

2015-11-02 Thread Konstantin Boudnik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985938#comment-14985938
 ] 

Konstantin Boudnik commented on SPARK-3474:
---

Actually, it seems there's still something to this problem. As of 1.5.1 I found 
that starting master like this:
{code}
. /usr/lib/spark/bin/load-spark-env.sh
/usr/lib/spark/bin/spark-class org.apache.spark.deploy.master.Master
{code}

binds Master to some interface IP address instead of the hostname. Using 
{{--host}} explicitly solves the issue. In this case, pointing Worker to the 
spark://hostname:port is futile as they never associate with each other: Master 
is bound to the IP address and not to the hostname. 

> The env variable SPARK_MASTER_IP does not work
> --
>
> Key: SPARK-3474
> URL: https://issues.apache.org/jira/browse/SPARK-3474
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.1
>Reporter: Chunjun Xiao
>
> There's some inconsistency regarding the env variable used to specify the 
> spark master host server.
> In spark source code (MasterArguments.scala), the env variable is 
> "SPARK_MASTER_HOST", while in the shell script (e.g., spark-env.sh, 
> start-master.sh), it's named "SPARK_MASTER_IP".
> This will introduce an issue in some case, e.g., if spark master is started  
> via "service spark-master start", which is built based on latest bigtop 
> (refer to bigtop/spark-master.svc).
> In this case, "SPARK_MASTER_IP" will have no effect.
> I suggest we change SPARK_MASTER_IP in the shell script to SPARK_MASTER_HOST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11275) [SQL] Incorrect results when using rollup/cube

2015-11-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11275:

Summary: [SQL] Incorrect results when using rollup/cube   (was: [SQL] 
Regression in rollup/cube )

> [SQL] Incorrect results when using rollup/cube 
> ---
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11275) [SQL] Incorrect results when using rollup/cube

2015-11-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11275:

Affects Version/s: 1.3.0
   1.4.0

> [SQL] Incorrect results when using rollup/cube 
> ---
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11250) Generate different alias for columns with same name during join

2015-11-02 Thread Narine Kokhlikyan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985860#comment-14985860
 ] 

Narine Kokhlikyan commented on SPARK-11250:
---

Hi [~davies], [~rxin], [~shivaram]

I have some questions regarding the joins:

1. For creating aliases we would need suffixes. This was an input argument of 
merge in R. We can of course have default values for suffixes, but what do you 
think about having it as an input argument similar to R?

2. Let's say that we have the following two dataframes:
scala> df
res49: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int]

scala> df2
res50: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int]

if I do joins like this: df.join(df2) or df.join(df2, df("rating") == 
df2("rating"))
the resulting dataframe has the following structure:
res58: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int, 
rating: int, income: double, age: int]

as a result, we could have something like this : 
org.apache.spark.sql.DataFrame = [rating_x: int, income_x: double, age_x: int, 
rating_y: int, income_y: double, age_y: int]

or just show like R does:
org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int]

3. Also R adds the suffixes only for the columns which are not in the join 
expression:
for example: df <- merge(iris,iris, by=c("Species"))
the df has the following structure:

colnames(df)
[1] "Species""Sepal.Length.x" "Sepal.Width.x"  "Petal.Length.x" 
"Petal.Width.x"  "Sepal.Length.y" "Sepal.Width.y" 
[8] "Petal.Length.y" "Petal.Width.y" 

Do you have any preferences ?

Thanks,
Narine

> Generate different alias for columns with same name during join
> ---
>
> Key: SPARK-11250
> URL: https://issues.apache.org/jira/browse/SPARK-11250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Narine Kokhlikyan
>
> It's confusing to see columns with same name after joining, and hard to 
> access them, we could generate different alias for them in joined DataFrame.
> see https://github.com/apache/spark/pull/9012/files#r42696855 as example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11275) [SQL] Regression in rollup/cube


 [ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11275:


Assignee: Apache Spark

> [SQL] Regression in rollup/cube 
> 
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11275) [SQL] Regression in rollup/cube


 [ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11275:


Assignee: (was: Apache Spark)

> [SQL] Regression in rollup/cube 
> 
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11275) [SQL] Regression in rollup/cube


[ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985935#comment-14985935
 ] 

Apache Spark commented on SPARK-11275:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/9419

> [SQL] Regression in rollup/cube 
> 
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8170) Ctrl-C in pyspark shell doesn't kill running job

2015-11-02 Thread Jacob Wellington (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985949#comment-14985949
 ] 

Jacob Wellington commented on SPARK-8170:
-

I'm running into an issue when I'm trying to connect to spark via a django 
server that connects to spark by creating a new thread. When I comment out line 
225 of the context.py file in the PR it works fine. Also works fine in 1.5.1.

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/asoni/.conda/envs/py35_spark_django/lib/python3.5/threading.py", 
line 923, in _bootstrap_inner
self.run()
  File "/home/asoni/.conda/envs/py35_spark_django/lib/python3.5/threading.py", 
line 871, in run
self._target(*self._args, **self._kwargs)
  File "/home/asoni/spark-server/spark_server/spark_jobs/views.py", line 67, in 
process_job
from spark_jobs.job_runner import JobRunner
  File "/home/asoni/spark-server/spark_server/spark_jobs/job_runner.py", line 
2, in 
from spark_jobs.extraction_plan_group_runner import *
  File 
"/home/asoni/spark-server/spark_server/spark_jobs/extraction_plan_group_runner.py",
 line 1, in 
from spark_jobs.spark_connection import SparkConnection
  File "/home/asoni/spark-server/spark_server/spark_jobs/spark_connection.py", 
line 5, in 
class SparkConnection:
  File "/home/asoni/spark-server/spark_server/spark_jobs/spark_connection.py", 
line 7, in SparkConnection
sc = SparkContext(conf=conf)
  File 
"/home/asoni/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/context.py",
 line 114, in __init__
conf, jsc, profiler_cls)
  File 
"/home/asoni/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/context.py",
 line 225, in _do_init
signal.signal(signal.SIGINT, signal_handler)
  File "/home/asoni/.conda/envs/py35_spark_django/lib/python3.5/signal.py", 
line 47, in signal
handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread

> Ctrl-C in pyspark shell doesn't kill running job
> 
>
> Key: SPARK-8170
> URL: https://issues.apache.org/jira/browse/SPARK-8170
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
> Fix For: 1.6.0
>
>
> Hitting Ctrl-C in spark-sql(and other tools like presto) cancels any running 
> job and starts a new input line on the prompt. It would be nice if pyspark 
> shell also can do that. Otherwise, in case a user submits a job, say he made 
> a mistake, and wants to cancel it, he needs to exit the shell and re-login to 
> continue his work. Re-login can be a pain especially in Spark on yarn, since 
> it takes a while to allocate AM container and initial executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11444) Allow bacth seqOp combination in treeAggregate

2015-11-02 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-11444:

Summary: Allow bacth seqOp combination in treeAggregate  (was: Allow bacth 
seqOp combination in treeReduce)

> Allow bacth seqOp combination in treeAggregate
> --
>
> Key: SPARK-11444
> URL: https://issues.apache.org/jira/browse/SPARK-11444
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Reporter: holdenk
>Priority: Minor
>
> Allow batch seqOp in treeReduce so as to allow better integration with GPU 
> type workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11444) Allow bacth seqOp combination in treeAggregate

2015-11-02 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985882#comment-14985882
 ] 

holdenk commented on SPARK-11444:
-

Draft design: 
https://docs.google.com/document/d/1i7OQnFj8WY9Ifwt6HWe2M2dUDHwHbqz455fpxAMD2k8/edit?usp=sharing

> Allow bacth seqOp combination in treeAggregate
> --
>
> Key: SPARK-11444
> URL: https://issues.apache.org/jira/browse/SPARK-11444
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Reporter: holdenk
>Priority: Minor
>
> Allow batch seqOp in treeAggregate so as to allow better integration with GPU 
> type workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11444) Allow bacth seqOp combination in treeAggregate

2015-11-02 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-11444:

Description: Allow batch seqOp in treeAggregate so as to allow better 
integration with GPU type workloads.  (was: Allow batch seqOp in treeReduce so 
as to allow better integration with GPU type workloads.)

> Allow bacth seqOp combination in treeAggregate
> --
>
> Key: SPARK-11444
> URL: https://issues.apache.org/jira/browse/SPARK-11444
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Reporter: holdenk
>Priority: Minor
>
> Allow batch seqOp in treeAggregate so as to allow better integration with GPU 
> type workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11439) Optiomization of creating sparse feature without dense one

2015-11-02 Thread Nakul Jindal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985890#comment-14985890
 ] 

Nakul Jindal commented on SPARK-11439:
--

I will work on this.

> Optiomization of creating sparse feature without dense one
> --
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11115) Host verification is not correct for IPv6

2015-11-02 Thread watson xi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985985#comment-14985985
 ] 

watson xi commented on SPARK-5:
---

started getting some intermittent socket connection errors after commenting out 
that line. a friends machine had the following IPV6 localhost line, which I 
copied and all seems to be well now: `fe80::1%lo0 localhost`

> Host verification is not correct for IPv6
> -
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>  Labels: starter
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-11-02 Thread Stefano Baghino (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985884#comment-14985884
 ] 

Stefano Baghino commented on SPARK-7425:


I'm checking the code and I'm not sure this solution would work: changing that 
line would break the compatibility with many calls to that function that user 
VectorUDT for the comparison. Do you agree?

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11115) Host verification is not correct for IPv6

2015-11-02 Thread watson xi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985985#comment-14985985
 ] 

watson xi edited comment on SPARK-5 at 11/2/15 8:39 PM:


started getting some intermittent socket connection errors after commenting out 
that line. a friends machine had the following IPV6 localhost line, which I 
copied and all seems to be well now: {{fe80::1%lo0 localhost}}


was (Author: watsonix):
started getting some intermittent socket connection errors after commenting out 
that line. a friends machine had the following IPV6 localhost line, which I 
copied and all seems to be well now: `fe80::1%lo0 localhost`

> Host verification is not correct for IPv6
> -
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>  Labels: starter
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9034) Reflect field names defined in GenericUDTF


 [ 
https://issues.apache.org/jira/browse/SPARK-9034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9034.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8456
[https://github.com/apache/spark/pull/8456]

> Reflect field names defined in GenericUDTF
> --
>
> Key: SPARK-9034
> URL: https://issues.apache.org/jira/browse/SPARK-9034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
> Fix For: 1.6.0
>
>
> Hive GenericUDTF#initialize() defines field names in a returned schema though,
> the current HiveGenericUDTF drops these names.
> We might need to reflect these in a logical plan tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11469) Initial implementation


 [ 
https://issues.apache.org/jira/browse/SPARK-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11469.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9393
[https://github.com/apache/spark/pull/9393]

> Initial implementation
> --
>
> Key: SPARK-11469
> URL: https://issues.apache.org/jira/browse/SPARK-11469
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11448) We should skip caching part-files in ParquetRelation when configured to merge schema and respect summaries

2015-11-02 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-11448:
---

 Summary: We should skip caching part-files in ParquetRelation when 
configured to merge schema and respect summaries
 Key: SPARK-11448
 URL: https://issues.apache.org/jira/browse/SPARK-11448
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11448) We should skip caching part-files in ParquetRelation when configured to merge schema and respect summaries


 [ 
https://issues.apache.org/jira/browse/SPARK-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11448:


Assignee: (was: Apache Spark)

> We should skip caching part-files in ParquetRelation when configured to merge 
> schema and respect summaries
> --
>
> Key: SPARK-11448
> URL: https://issues.apache.org/jira/browse/SPARK-11448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We now cache part-files, metadata, common metadata in ParquetRelation as 
> currentLeafStatuses. However, when configured to merge schema and respect 
> summaries, dataStatuses (`FileStatus` objects of all part-files) are not 
> necessary anymore. We should skip them when caching in driver side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11448) We should skip caching part-files in ParquetRelation when configured to merge schema and respect summaries

2015-11-02 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-11448:

Description: We now cache part-files, metadata, common metadata in 
ParquetRelation as currentLeafStatuses. However, when configured to merge 
schema and respect summaries, dataStatuses (`FileStatus` objects of all 
part-files) are not necessary anymore. We should skip them when caching in 
driver side.  (was: We now cache part-files, metadata, common metadata in )

> We should skip caching part-files in ParquetRelation when configured to merge 
> schema and respect summaries
> --
>
> Key: SPARK-11448
> URL: https://issues.apache.org/jira/browse/SPARK-11448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We now cache part-files, metadata, common metadata in ParquetRelation as 
> currentLeafStatuses. However, when configured to merge schema and respect 
> summaries, dataStatuses (`FileStatus` objects of all part-files) are not 
> necessary anymore. We should skip them when caching in driver side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11448) We should skip caching part-files in ParquetRelation when configured to merge schema and respect summaries


[ 
https://issues.apache.org/jira/browse/SPARK-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984878#comment-14984878
 ] 

Apache Spark commented on SPARK-11448:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9405

> We should skip caching part-files in ParquetRelation when configured to merge 
> schema and respect summaries
> --
>
> Key: SPARK-11448
> URL: https://issues.apache.org/jira/browse/SPARK-11448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We now cache part-files, metadata, common metadata in ParquetRelation as 
> currentLeafStatuses. However, when configured to merge schema and respect 
> summaries, dataStatuses (`FileStatus` objects of all part-files) are not 
> necessary anymore. We should skip them when caching in driver side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-02 Thread Study Hsueh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984879#comment-14984879
 ] 

Study Hsueh commented on SPARK-11191:
-

Ok, it looks like `FunctionRegistry.getFunctionInfo` does not contain permanent 
and temporary functions information.

The SELECT clause lookups functions in the order:
1. underlying.lookupFunction (built-in FunctionRegistry)
2. FunctionRegistry.getFunctionInfo

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>Priority: Blocker
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
>

[jira] [Commented] (SPARK-11275) [SQL] Regression in rollup/cube

2015-11-02 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984881#comment-14984881
 ] 

Xiao Li commented on SPARK-11275:
-

Agree. It becomes more complex when you need to resolve both cases at the same 
times. When people understand the problem, the fix looks very straightforward.  

> [SQL] Regression in rollup/cube 
> 
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3789) [GRAPHX] Python bindings for GraphX

2015-11-02 Thread Ricardo Almeida (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984880#comment-14984880
 ] 

Ricardo Almeida commented on SPARK-3789:


Any update on having GraphX on Python?

> [GRAPHX] Python bindings for GraphX
> ---
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
> Attachments: PyGraphX_design_doc.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11275) [SQL] Regression in rollup/cube

2015-11-02 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984863#comment-14984863
 ] 

Herman van Hovell commented on SPARK-11275:
---

This is caused by the fact that the logical Expand operator does not make a 
difference between expressions used for grouping and for aggregation.

Looking forward to the PR.

> [SQL] Regression in rollup/cube 
> 
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11448) We should skip caching part-files in ParquetRelation when configured to merge schema and respect summaries


 [ 
https://issues.apache.org/jira/browse/SPARK-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11448:


Assignee: Apache Spark

> We should skip caching part-files in ParquetRelation when configured to merge 
> schema and respect summaries
> --
>
> Key: SPARK-11448
> URL: https://issues.apache.org/jira/browse/SPARK-11448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We now cache part-files, metadata, common metadata in ParquetRelation as 
> currentLeafStatuses. However, when configured to merge schema and respect 
> summaries, dataStatuses (`FileStatus` objects of all part-files) are not 
> necessary anymore. We should skip them when caching in driver side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11449) PortableDataStream should be a factory


[ 
https://issues.apache.org/jira/browse/SPARK-11449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985419#comment-14985419
 ] 

Apache Spark commented on SPARK-11449:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/9417

> PortableDataStream should be a factory
> --
>
> Key: SPARK-11449
> URL: https://issues.apache.org/jira/browse/SPARK-11449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Herman van Hovell
>Priority: Minor
>
> {{PortableDataStream}}'s close behavior caught me by surprise the other day. 
> I assumed incorrectly that closing the inputstream it provides would also 
> close the {{PortableDataStream}}. This leads to quite a confusing situation 
> in when you try to reuse the {{PortableDataStream}}: the state of the 
> {{PortableDataStream}} indicates that it is open, whereas the underlying 
> inputstream is actually closed.
> I'd like either to improve the documentation, or add an {{InputStream}} 
> wrapper that closes the {{PortableDataStream}} when you close the 
> {{InputStream}}. Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-11-02 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985438#comment-14985438
 ] 

Ted Yu commented on SPARK-11371:


[~rxin] [~yhuai]:
Your comment is welcome.

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11460) Locality waits should be based on task set creation time, not last launch time

2015-11-02 Thread Shengyue Ji (JIRA)

Shengyue Ji created SPARK-11460:
---

Summary: Locality waits should be based on task set creation time,
not last launch time
Key: SPARK-11460
URL: https://issues.apache.org/jira/browse/SPARK-11460
Project: Spark
Issue Type: Bug
Components: Scheduler
Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0, 1.3.1, 1.3.0, 1.2.2, 1.2.1,
1.2.0, 1.1.1, 1.1.0, 1.0.2, 1.0.1, 1.0.0
Environment: YARN
Reporter: Shengyue Ji

Spark waits for spark.locality.waits period before going from RACK_LOCAL to ANY
when selecting an executor for assignment. The timeout was essentially reset
each time a new assignment is made.

We were running Spark streaming on Kafka with a 10 second batch window on 32
Kafka partitions with 16 executors. All executors were in the ANY group. At one
point one RACK_LOCAL executor was added and all tasks were assigned to it. Each
task took about 0.6 second to process, resetting the spark.locality.wait
timeout (3000ms) repeatedly. This caused the whole process to under utilize
resources and created an increasing backlog.

spark.locality.wait should be based on the task set creation time, not last
launch time so that after 3000ms of initial creation, all executors can get
tasks assigned to them.

We are specifying a zero timeout for now as a workaround to disable locality
optimization.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L556

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11462) Add JavaStreamingListener

2015-11-02 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-11462:


 Summary: Add JavaStreamingListener
 Key: SPARK-11462
 URL: https://issues.apache.org/jira/browse/SPARK-11462
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu


Add Java friendly API for StreamingListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>

2015-11-02 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986101#comment-14986101
 ] 

Bryan Cutler commented on SPARK-4557:
-

Hi [~somi...@us.ibm.com],  the right way to make a pull request is described 
[here|https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-PullRequest].
  I'd be happy to do this unless you plan on completing it?

> Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a 
> Function<..., Void>
> ---
>
> Key: SPARK-4557
> URL: https://issues.apache.org/jira/browse/SPARK-4557
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Alexis Seigneurin
>Priority: Minor
>  Labels: starter
>
> In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You 
> have to write:
> {code:java}
> .foreachRDD(items -> {
> ...;
> return null;
> });
> {code}
> Instead of:
> {code:java}
> .foreachRDD(items -> ...);
> {code}
> This is because the foreachRDD method accepts a Function, Void> 
> instead of a VoidFunction>. This would make sense to change it 
> to a VoidFunction as, in Spark's API, the foreach method already accepts a 
> VoidFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11337) Make example code in user guide testable


 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11337:
--
Comment: was deleted

(was: I totally forgot I already made one ...)

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11337) Make example code in user guide testable


 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11337:
--
Comment: was deleted

(was: I totally forgot I already made one ...)

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11337) Make example code in user guide testable


 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11337:
--
Comment: was deleted

(was: I totally forgot I already made one ...)

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11336) Include a link to the source file in generated example code


 [ 
https://issues.apache.org/jira/browse/SPARK-11336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11336:
--
Shepherd: Xiangrui Meng

> Include a link to the source file in generated example code
> ---
>
> Key: SPARK-11336
> URL: https://issues.apache.org/jira/browse/SPARK-11336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> It would be nice to include a link to the example source file at the bottom 
> of each code example. So if users want to try them, they know where to find. 
> The font size should be small and not interrupting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11359) Kinesis receiver does not checkpoint to DynamoDB if there is no new data.


[ 
https://issues.apache.org/jira/browse/SPARK-11359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986137#comment-14986137
 ] 

Apache Spark commented on SPARK-11359:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/9421

> Kinesis receiver does not checkpoint to DynamoDB if there is no new data. 
> --
>
> Key: SPARK-11359
> URL: https://issues.apache.org/jira/browse/SPARK-11359
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Tathagata Das
>Assignee: Burak Yavuz
>
> Current implementation of KinesisRecordProcessor checkpoints to DynamoDB only 
> when there is a new record received. So if there is no new data for a while, 
> the latest received sequence numbers are not checkpointed. This is not 
> intuitive behavior and should be fixed using  timer task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11383) Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example


 [ 
https://issues.apache.org/jira/browse/SPARK-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11383.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9353
[https://github.com/apache/spark/pull/9353]

> Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md 
> using include_example
> ---
>
> Key: SPARK-11383
> URL: https://issues.apache.org/jira/browse/SPARK-11383
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
> Fix For: 1.6.0
>
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron


 [ 
https://issues.apache.org/jira/browse/SPARK-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11262:
--
Target Version/s: 1.6.0

> Unit test for gradient, loss layers, memory management for multilayer 
> perceptron
> 
>
> Key: SPARK-11262
> URL: https://issues.apache.org/jira/browse/SPARK-11262
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Multi-layer perceptron requires more rigorous tests and refactoring of layer 
> interfaces to accommodate development of new features.
> 1)Implement unit test for gradient and loss
> 2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11459) Allow configuring checkpoint dir, filenames

2015-11-02 Thread Ryan Williams (JIRA)

Ryan Williams created SPARK-11459:
-

 Summary: Allow configuring checkpoint dir, filenames
 Key: SPARK-11459
 URL: https://issues.apache.org/jira/browse/SPARK-11459
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Ryan Williams


I frequently want to persist some RDDs to disk and choose the names of the 
files that they are saved as.

Currently, the {{RDD.checkpoint}} flow [writes to a directory with a UUID in 
its 
name|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L2050],
 and the file is [always named after the RDD's 
ID|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala#L96].

Is there any reason not to allow the user to e.g. pass a string to 
{{RDD.checkpoint}} that will set the location that the RDD is checkpointed to?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver


 [ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9836:
-
Shepherd: Xiangrui Meng
Priority: Critical  (was: Major)

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11463) SparkContext fail to create in non-main thread in Python

Davies Liu created SPARK-11463:
--

 Summary: SparkContext fail to create in non-main thread in Python
 Key: SPARK-11463
 URL: https://issues.apache.org/jira/browse/SPARK-11463
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Priority: Blocker


https://github.com/apache/spark/commit/2e572c4135c3f5ad3061c1f58cdb8a70bed0a9d3#commitcomment-14137386

{code}
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/asoni/.conda/envs/py35_spark_django/lib/python3.5/threading.py", 
line 923, in _bootstrap_inner
self.run()
  File "/home/asoni/.conda/envs/py35_spark_django/lib/python3.5/threading.py", 
line 871, in run
self._target(*self._args, **self._kwargs)
  File "/home/asoni/spark-server/spark_server/spark_jobs/views.py", line 67, in 
process_job
from spark_jobs.job_runner import JobRunner
  File "/home/asoni/spark-server/spark_server/spark_jobs/job_runner.py", line 
2, in 
from spark_jobs.extraction_plan_group_runner import *
  File 
"/home/asoni/spark-server/spark_server/spark_jobs/extraction_plan_group_runner.py",
 line 1, in 
from spark_jobs.spark_connection import SparkConnection
  File "/home/asoni/spark-server/spark_server/spark_jobs/spark_connection.py", 
line 5, in 
class SparkConnection:
  File "/home/asoni/spark-server/spark_server/spark_jobs/spark_connection.py", 
line 7, in SparkConnection
sc = SparkContext(conf=conf)
  File 
"/home/asoni/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/context.py",
 line 114, in __init__
conf, jsc, profiler_cls)
  File 
"/home/asoni/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/context.py",
 line 225, in _do_init
signal.signal(signal.SIGINT, signal_handler)
  File "/home/asoni/.conda/envs/py35_spark_django/lib/python3.5/signal.py", 
line 47, in signal
handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11358) Deprecate `runs` in k-means


 [ 
https://issues.apache.org/jira/browse/SPARK-11358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11358.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9322
[https://github.com/apache/spark/pull/9322]

> Deprecate `runs` in k-means
> ---
>
> Key: SPARK-11358
> URL: https://issues.apache.org/jira/browse/SPARK-11358
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> `runs` introduces extra complexity and overhead in MLlib's k-means 
> implementation. I haven't seen much usage with `runs` not equal to `1`. We 
> can deprecate this method in 1.6, and remove or void it in 1.7. It helps us 
> simplify the implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11445) Replace example code in mllib-ensembles.md using include_example


 [ 
https://issues.apache.org/jira/browse/SPARK-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11445:
--
Component/s: MLlib

> Replace example code in mllib-ensembles.md using include_example
> 
>
> Key: SPARK-11445
> URL: https://issues.apache.org/jira/browse/SPARK-11445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Gabor Liptak
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11337) Make example code in user guide testable


 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11337:
--
Comment: was deleted

(was: I totally forgot I already made one ...)

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11445) Replace example code in mllib-ensembles.md using include_example


 [ 
https://issues.apache.org/jira/browse/SPARK-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11445:
--
Shepherd: Xusen Yin

> Replace example code in mllib-ensembles.md using include_example
> 
>
> Key: SPARK-11445
> URL: https://issues.apache.org/jira/browse/SPARK-11445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Gabor Liptak
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11371) Make "mean" an alias for "avg" operator


 [ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11371:
-
Assignee: Ted Yu

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11371) Make "mean" an alias for "avg" operator


 [ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11371.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9332
[https://github.com/apache/spark/pull/9332]

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ted Yu
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10429) MutableProjection should evaluate all expressions first and then update the mutable row


 [ 
https://issues.apache.org/jira/browse/SPARK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-10429:
--

Assignee: Davies Liu  (was: Wenchen Fan)

> MutableProjection should evaluate all expressions first and then update the 
> mutable row
> ---
>
> Key: SPARK-10429
> URL: https://issues.apache.org/jira/browse/SPARK-10429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Blocker
>
> Right now, SQL's mutable projection updates every value of the mutable 
> project after it evaluates the corresponding expression. This makes the 
> behavior of MutableProjection confusing and complicate the implementation of 
> common aggregate functions like stddev because developers need to be aware 
> that when evaluating {{i+1}}th expression of a mutable projection, {{i}}th 
> slot of the mutable row has already been updated.
> A better behavior of MutableProjection will be that we evaluate all 
> expressions first and then update all values of the mutable row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-11-02 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986089#comment-14986089
 ] 

shane knapp commented on SPARK-11255:
-

okie dokie!

an RPM of 3.1.1-7 was found, and i have a staging VM up, so we should get the 
testing done this week.

if things go well before thursday, i might roll the R changes in during the 
scheduled downtime then.  if not, most likely the week after.

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11359) Kinesis receiver does not checkpoint to DynamoDB if there is no new data.


 [ 
https://issues.apache.org/jira/browse/SPARK-11359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11359:


Assignee: Apache Spark  (was: Burak Yavuz)

> Kinesis receiver does not checkpoint to DynamoDB if there is no new data. 
> --
>
> Key: SPARK-11359
> URL: https://issues.apache.org/jira/browse/SPARK-11359
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Current implementation of KinesisRecordProcessor checkpoints to DynamoDB only 
> when there is a new record received. So if there is no new data for a while, 
> the latest received sequence numbers are not checkpointed. This is not 
> intuitive behavior and should be fixed using  timer task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11336) Include a link to the source file in generated example code