[jira] [Assigned] (SPARK-8240) string function: concat

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8240:
--

Assignee: Reynold Xin  (was: Cheng Hao)

> string function: concat
> ---
>
> Key: SPARK-8240
> URL: https://issues.apache.org/jira/browse/SPARK-8240
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> concat(string|binary A, string|binary B...): string / binary
> Returns the string or bytes resulting from concatenating the strings or bytes 
> passed in as parameters in order. For example, concat('foo', 'bar') results 
> in 'foobar'. Note that this function can take any number of input strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8240) string function: concat

2015-07-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632294#comment-14632294
 ] 

Reynold Xin commented on SPARK-8240:


[~adrian-wang] I had some time tonight and wrote a version of this that has 
codegen and avoids conversion back and forth between String and UTF8String.


> string function: concat
> ---
>
> Key: SPARK-8240
> URL: https://issues.apache.org/jira/browse/SPARK-8240
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> concat(string|binary A, string|binary B...): string / binary
> Returns the string or bytes resulting from concatenating the strings or bytes 
> passed in as parameters in order. For example, concat('foo', 'bar') results 
> in 'foobar'. Note that this function can take any number of input strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7218) Create a real iterator with open/close for Spark SQL

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7218:
---
Target Version/s: 1.6.0

> Create a real iterator with open/close for Spark SQL
> 
>
> Key: SPARK-7218
> URL: https://issues.apache.org/jira/browse/SPARK-7218
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8240) string function: concat

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632292#comment-14632292
 ] 

Apache Spark commented on SPARK-8240:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7486

> string function: concat
> ---
>
> Key: SPARK-8240
> URL: https://issues.apache.org/jira/browse/SPARK-8240
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> concat(string|binary A, string|binary B...): string / binary
> Returns the string or bytes resulting from concatenating the strings or bytes 
> passed in as parameters in order. For example, concat('foo', 'bar') results 
> in 'foobar'. Note that this function can take any number of input strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9149) Add an example of spark.ml KMeans

2015-07-17 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632286#comment-14632286
 ] 

Yu Ishikawa commented on SPARK-9149:


Please assign this to me?

> Add an example of spark.ml KMeans
> -
>
> Key: SPARK-9149
> URL: https://issues.apache.org/jira/browse/SPARK-9149
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, ML
>Reporter: Yu Ishikawa
> Fix For: 1.5.0
>
>
> Create an example of KMeans API for spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9149) Add an example of spark.ml KMeans

2015-07-17 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-9149:
---
Component/s: Examples

> Add an example of spark.ml KMeans
> -
>
> Key: SPARK-9149
> URL: https://issues.apache.org/jira/browse/SPARK-9149
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, ML
>Reporter: Yu Ishikawa
> Fix For: 1.5.0
>
>
> Create an example of KMeans API for spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9149) Add an example of spark.ml KMeans

2015-07-17 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-9149:
--

 Summary: Add an example of spark.ml KMeans
 Key: SPARK-9149
 URL: https://issues.apache.org/jira/browse/SPARK-9149
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Yu Ishikawa


Create an example of KMeans API for spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8916) Add @since tags to mllib.regression

2015-07-17 Thread Prayag Chandran Nirmala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632262#comment-14632262
 ] 

Prayag Chandran Nirmala commented on SPARK-8916:


I would like to take this up, if that's okay.

> Add @since tags to mllib.regression
> ---
>
> Key: SPARK-8916
> URL: https://issues.apache.org/jira/browse/SPARK-8916
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam

2015-07-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9118.
--
   Resolution: Fixed
Fix Version/s: (was: 1.4.0)
   1.5.0

Issue resolved by pull request 7481
[https://github.com/apache/spark/pull/7481]

> Implement integer array parameters for ml.param as IntArrayParam
> 
>
> Key: SPARK-9118
> URL: https://issues.apache.org/jira/browse/SPARK-9118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>Priority: Minor
> Fix For: 1.5.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> ml/param/params.scala lacks integer array parameter. It is needed for some 
> models such as multilayer perceptron to specify the layer sizes. I suggest to 
> implement it as IntArrayParam similarly to DoubleArrayParam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam

2015-07-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9118:
-
Assignee: Rekha Joshi

> Implement integer array parameters for ml.param as IntArrayParam
> 
>
> Key: SPARK-9118
> URL: https://issues.apache.org/jira/browse/SPARK-9118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>Assignee: Rekha Joshi
>Priority: Minor
> Fix For: 1.5.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> ml/param/params.scala lacks integer array parameter. It is needed for some 
> models such as multilayer perceptron to specify the layer sizes. I suggest to 
> implement it as IntArrayParam similarly to DoubleArrayParam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8246) string function: get_json_object

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8246:
---
Assignee: Nathan Howell  (was: Cheng Hao)

> string function: get_json_object
> 
>
> Key: SPARK-8246
> URL: https://issues.apache.org/jira/browse/SPARK-8246
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nathan Howell
>
> get_json_object(string json_string, string path): string
> This is actually fairly complicated. Take a look at 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> Only add this to SQL, not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8246) string function: get_json_object

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8246:
---

Assignee: Apache Spark  (was: Cheng Hao)

> string function: get_json_object
> 
>
> Key: SPARK-8246
> URL: https://issues.apache.org/jira/browse/SPARK-8246
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> get_json_object(string json_string, string path): string
> This is actually fairly complicated. Take a look at 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> Only add this to SQL, not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8246) string function: get_json_object

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8246:
---

Assignee: Cheng Hao  (was: Apache Spark)

> string function: get_json_object
> 
>
> Key: SPARK-8246
> URL: https://issues.apache.org/jira/browse/SPARK-8246
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> get_json_object(string json_string, string path): string
> This is actually fairly complicated. Take a look at 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> Only add this to SQL, not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8246) string function: get_json_object

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632218#comment-14632218
 ] 

Apache Spark commented on SPARK-8246:
-

User 'NathanHowell' has created a pull request for this issue:
https://github.com/apache/spark/pull/7485

> string function: get_json_object
> 
>
> Key: SPARK-8246
> URL: https://issues.apache.org/jira/browse/SPARK-8246
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> get_json_object(string json_string, string path): string
> This is actually fairly complicated. Take a look at 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> Only add this to SQL, not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9148) User-facing documentation for NaN handling semantics

2015-07-17 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9148:
-

 Summary: User-facing documentation for NaN handling semantics
 Key: SPARK-9148
 URL: https://issues.apache.org/jira/browse/SPARK-9148
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Reporter: Josh Rosen


Once we've finalized our NaN changes for Spark 1.5, we need to create 
user-facing documentation to explain our chosen semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8159) Improve SQL/DataFrame expression coverage

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8159:
--

Assignee: Reynold Xin

> Improve SQL/DataFrame expression coverage
> -
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8947) Improve expression type coercion, casting & checking

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8947:
--

Assignee: Reynold Xin

> Improve expression type coercion, casting & checking
> 
>
> Key: SPARK-8947
> URL: https://issues.apache.org/jira/browse/SPARK-8947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This is an umbrella ticket to improve type casting & checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8846) Maintain binary compatibility for in function

2015-07-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632208#comment-14632208
 ] 

Reynold Xin commented on SPARK-8846:


[~yuu.ishik...@gmail.com] just a reminder.

> Maintain binary compatibility for in function
> -
>
> Key: SPARK-8846
> URL: https://issues.apache.org/jira/browse/SPARK-8846
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> In order to maintain binary compatibility, we should add a new "in" function 
> that takes Any, rather than changing the existing one.
> cc [~yuu.ishik...@gmail.com] can you work on this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9146) NaN should be greater than all other values

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9146:
---

Assignee: Josh Rosen  (was: Apache Spark)

> NaN should be greater than all other values
> ---
>
> Key: SPARK-9146
> URL: https://issues.apache.org/jira/browse/SPARK-9146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
>
> Based on the design in SPARK-9079, NaN should be greater than all other 
> non-NaN numeric values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9076) Improve NaN value handling

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-9076:
--

Assignee: Reynold Xin

> Improve NaN value handling
> --
>
> Key: SPARK-9076
> URL: https://issues.apache.org/jira/browse/SPARK-9076
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This is an umbrella ticket for handling NaN values.
> For general design, please see 
> https://issues.apache.org/jira/browse/SPARK-9079



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9076) Improve NaN value handling

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9076:
---
Description: 
This is an umbrella ticket for handling NaN values.

For general design, please see https://issues.apache.org/jira/browse/SPARK-9079

  was:
This is an umbrella ticket for handling NaN values.



> Improve NaN value handling
> --
>
> Key: SPARK-9076
> URL: https://issues.apache.org/jira/browse/SPARK-9076
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket for handling NaN values.
> For general design, please see 
> https://issues.apache.org/jira/browse/SPARK-9079



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9146) NaN should be greater than all other values

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9146:
---

Assignee: Apache Spark  (was: Josh Rosen)

> NaN should be greater than all other values
> ---
>
> Key: SPARK-9146
> URL: https://issues.apache.org/jira/browse/SPARK-9146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Critical
>
> Based on the design in SPARK-9079, NaN should be greater than all other 
> non-NaN numeric values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9146) NaN should be greater than all other values

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632207#comment-14632207
 ] 

Apache Spark commented on SPARK-9146:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7194

> NaN should be greater than all other values
> ---
>
> Key: SPARK-9146
> URL: https://issues.apache.org/jira/browse/SPARK-9146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
>
> Based on the design in SPARK-9079, NaN should be greater than all other 
> non-NaN numeric values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8797) Sorting float/double column containing NaNs can lead to "Comparison method violates its general contract!" errors

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8797:
---
Assignee: Josh Rosen

> Sorting float/double column containing NaNs can lead to "Comparison method 
> violates its general contract!" errors
> -
>
> Key: SPARK-8797
> URL: https://issues.apache.org/jira/browse/SPARK-8797
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> When sorting a float or double column that contains NaN (not a number) 
> values, TimSort may throw a ""Comparison method violates its general 
> contract!" error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6573) Convert inbound NaN values as null

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6573:
---
Assignee: Davies Liu

> Convert inbound NaN values as null
> --
>
> Key: SPARK-6573
> URL: https://issues.apache.org/jira/browse/SPARK-6573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Fabian Boehnlein
>Assignee: Davies Liu
>
> In pandas it is common to use numpy.nan as the null value, for missing data 
> or whatever.
> http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
> http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
> http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
> createDataFrame however only works with None as null values, parsing them as 
> None in the RDD.
> I suggest to add support for np.nan values in pandas DataFrames.
> current stracktrace when calling a DataFrame with object type columns with 
> np.nan values (which are floats)
> {code}
> TypeError Traceback (most recent call last)
>  in ()
> > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> createDataFrame(self, data, schema, samplingRatio)
> 339 schema = self._inferSchema(data.map(lambda r: 
> row_cls(*r)), samplingRatio)
> 340 
> --> 341 return self.applySchema(data, schema)
> 342 
> 343 def registerDataFrameAsTable(self, rdd, tableName):
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> applySchema(self, rdd, schema)
> 246 
> 247 for row in rows:
> --> 248 _verify_type(row, schema)
> 249 
> 250 # convert python objects to sql data
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1064  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1065 for v, f in zip(obj, dataType.fields):
> -> 1066 _verify_type(v, f.dataType)
>1067 
>1068 _cached_cls = weakref.WeakValueDictionary()
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1048 if type(obj) not in _acceptable_types[_type]:
>1049 raise TypeError("%s can not accept object in type %s"
> -> 1050 % (dataType, type(obj)))
>1051 
>1052 if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9146) NaN should be greater than all other values

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9146:
---
Assignee: Josh Rosen

> NaN should be greater than all other values
> ---
>
> Key: SPARK-9146
> URL: https://issues.apache.org/jira/browse/SPARK-9146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
>
> Based on the design in SPARK-9079, NaN should be greater than all other 
> non-NaN numeric values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7879) KMeans API for spark.ml Pipelines

2015-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7879.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6756
[https://github.com/apache/spark/pull/6756]

> KMeans API for spark.ml Pipelines
> -
>
> Key: SPARK-7879
> URL: https://issues.apache.org/jira/browse/SPARK-7879
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Critical
> Fix For: 1.5.0
>
>
> Create a K-Means API for the spark.ml Pipelines API.  This should wrap the 
> existing KMeans implementation in spark.mllib.
> This should be the first clustering method added to Pipelines, and it will be 
> important to consider [SPARK-7610] and think about designing the clustering 
> API.  We do not have to have abstractions from the beginning (and probably 
> should not) but should think far enough ahead so we can add abstractions 
> later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8281) udf_asin and udf_acos test failure

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8281.

   Resolution: Fixed
 Assignee: Yijie Shen
Fix Version/s: 1.5.0

> udf_asin and udf_acos test failure
> --
>
> Key: SPARK-8281
> URL: https://issues.apache.org/jira/browse/SPARK-8281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yijie Shen
>Priority: Blocker
> Fix For: 1.5.0
>
>
> acos/asin in Hive returns NaN for not a number, whereas we always return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8280) udf7 failed due to null vs nan semantics

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8280.

   Resolution: Fixed
 Assignee: Yijie Shen
Fix Version/s: 1.5.0

> udf7 failed due to null vs nan semantics
> 
>
> Key: SPARK-8280
> URL: https://issues.apache.org/jira/browse/SPARK-8280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yijie Shen
>Priority: Blocker
> Fix For: 1.5.0
>
>
> To execute
> {code}
> sbt/sbt -Phive -Dspark.hive.whitelist="udf7.*" "hive/test-only 
> org.apache.spark.sql.hive.execution.HiveCompatibilitySuite"
> {code}
> If we want to be consistent with Hive, we need to special case our log 
> function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9147) UnsafeRow should canonicalize NaN values

2015-07-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9147:
--

 Summary: UnsafeRow should canonicalize NaN values
 Key: SPARK-9147
 URL: https://issues.apache.org/jira/browse/SPARK-9147
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


NaN has many different representations in raw bytes.

When we set a double/float value, we should check whether it is NaN, and a 
binary representation that is canonicalized, so we can do comparison on bytes 
directly.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9146) NaN should be greater than all other values

2015-07-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9146:
--

 Summary: NaN should be greater than all other values
 Key: SPARK-9146
 URL: https://issues.apache.org/jira/browse/SPARK-9146
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical


Based on the design in SPARK-9079, NaN should be greater than all other non-NaN 
numeric values.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9145) Equality test on NaN = NaN should return true

2015-07-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9145:
--

 Summary: Equality test on NaN = NaN should return true
 Key: SPARK-9145
 URL: https://issues.apache.org/jira/browse/SPARK-9145
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical


Based on the design in SPARK-9079, we want NaN = NaN to return true in 
SQL/DataFrame.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9079) Design NaN semantics

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9079.

   Resolution: Fixed
 Assignee: Michael Armbrust
Fix Version/s: 1.5.0

> Design NaN semantics
> 
>
> Key: SPARK-9079
> URL: https://issues.apache.org/jira/browse/SPARK-9079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> 1. What should NaN = NaN return?
> NaN = NaN should return true.
> 2. If we see NaN in the group by key column, should we group NaN values into 
> one group, or into different groups?
> All NaN values should be grouped together.
> 3. What about NaN in join keys?
> NaN should be treated as a normal value in join keys.
> 4. When aggregating over columns containing NaN, should the result be NaN, or 
> should the result exclude NaN values (treating them like nulls)?
> This is TO BE DECIDED. By default, the behavior is to return NaN.
> 5. Where should NaN go in sorting?
> NaN should go last when in ascending order, larger than any other numeric 
> value.
> Note that 5 is much more important than the other 4 since right now the 
> sorter throws exceptions on NaN values. See SPARK-8797.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9079) Design NaN semantics

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9079:
---
Description: 
1. What should NaN = NaN return?

NaN = NaN should return true.

2. If we see NaN in the group by key column, should we group NaN values into 
one group, or into different groups?

All NaN values should be grouped together.

3. What about NaN in join keys?

NaN should be treated as a normal value in join keys.

4. When aggregating over columns containing NaN, should the result be NaN, or 
should the result exclude NaN values (treating them like nulls)?

5. Where should NaN go in sorting?

NaN should go last when in ascending order, larger than any other numeric value.


Note that 5 is much more important than the other 4 since right now the sorter 
throws exceptions on NaN values. See SPARK-8797.


  was:
1. What should NaN = NaN return?

2. If we see NaN in the group by key column, should we group NaN values into 
one group, or into different groups?

3. What about NaN in join keys?

4. When aggregating over columns containing NaN, should the result be NaN, or 
should the result exclude NaN values (treating them like nulls)?

5. Where should NaN go in sorting?

Note that 5 is much more important than the other 4 since right now the sorter 
throws exceptions on NaN values. See SPARK-8797.



> Design NaN semantics
> 
>
> Key: SPARK-9079
> URL: https://issues.apache.org/jira/browse/SPARK-9079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 1.5.0
>
>
> 1. What should NaN = NaN return?
> NaN = NaN should return true.
> 2. If we see NaN in the group by key column, should we group NaN values into 
> one group, or into different groups?
> All NaN values should be grouped together.
> 3. What about NaN in join keys?
> NaN should be treated as a normal value in join keys.
> 4. When aggregating over columns containing NaN, should the result be NaN, or 
> should the result exclude NaN values (treating them like nulls)?
> 5. Where should NaN go in sorting?
> NaN should go last when in ascending order, larger than any other numeric 
> value.
> Note that 5 is much more important than the other 4 since right now the 
> sorter throws exceptions on NaN values. See SPARK-8797.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9079) Design NaN semantics

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9079:
---
Description: 
1. What should NaN = NaN return?

NaN = NaN should return true.

2. If we see NaN in the group by key column, should we group NaN values into 
one group, or into different groups?

All NaN values should be grouped together.

3. What about NaN in join keys?

NaN should be treated as a normal value in join keys.

4. When aggregating over columns containing NaN, should the result be NaN, or 
should the result exclude NaN values (treating them like nulls)?

This is TO BE DECIDED. By default, the behavior is to return NaN.


5. Where should NaN go in sorting?

NaN should go last when in ascending order, larger than any other numeric value.


Note that 5 is much more important than the other 4 since right now the sorter 
throws exceptions on NaN values. See SPARK-8797.


  was:
1. What should NaN = NaN return?

NaN = NaN should return true.

2. If we see NaN in the group by key column, should we group NaN values into 
one group, or into different groups?

All NaN values should be grouped together.

3. What about NaN in join keys?

NaN should be treated as a normal value in join keys.

4. When aggregating over columns containing NaN, should the result be NaN, or 
should the result exclude NaN values (treating them like nulls)?

5. Where should NaN go in sorting?

NaN should go last when in ascending order, larger than any other numeric value.


Note that 5 is much more important than the other 4 since right now the sorter 
throws exceptions on NaN values. See SPARK-8797.



> Design NaN semantics
> 
>
> Key: SPARK-9079
> URL: https://issues.apache.org/jira/browse/SPARK-9079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 1.5.0
>
>
> 1. What should NaN = NaN return?
> NaN = NaN should return true.
> 2. If we see NaN in the group by key column, should we group NaN values into 
> one group, or into different groups?
> All NaN values should be grouped together.
> 3. What about NaN in join keys?
> NaN should be treated as a normal value in join keys.
> 4. When aggregating over columns containing NaN, should the result be NaN, or 
> should the result exclude NaN values (treating them like nulls)?
> This is TO BE DECIDED. By default, the behavior is to return NaN.
> 5. Where should NaN go in sorting?
> NaN should go last when in ascending order, larger than any other numeric 
> value.
> Note that 5 is much more important than the other 4 since right now the 
> sorter throws exceptions on NaN values. See SPARK-8797.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7026) LeftSemiJoin can not work when it has both equal condition and not equal condition.

2015-07-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7026.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5643
[https://github.com/apache/spark/pull/5643]

> LeftSemiJoin can not work when it  has both equal condition and not equal 
> condition. 
> -
>
> Key: SPARK-7026
> URL: https://issues.apache.org/jira/browse/SPARK-7026
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Zhongshuai Pei
>Assignee: Adrian Wang
> Fix For: 1.5.0
>
>
> Run sql like that 
> {panel}
> select *
> from
> web_sales ws1
> left semi join
> web_sales ws2
> on ws1.ws_order_number = ws2.ws_order_number
> and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk 
> {panel}
>  then get an exception
> {panel}
> Couldn't find ws_warehouse_sk#287 in 
> {ws_sold_date_sk#237,ws_sold_time_sk#238,ws_ship_date_sk#239,ws_item_sk#240,ws_bill_customer_sk#241,ws_bill_cdemo_sk#242,ws_bill_hdemo_sk#243,ws_bill_addr_sk#244,ws_ship_customer_sk#245,ws_ship_cdemo_sk#246,ws_ship_hdemo_sk#247,ws_ship_addr_sk#248,ws_web_page_sk#249,ws_web_site_sk#250,ws_ship_mode_sk#251,ws_warehouse_sk#252,ws_promo_sk#253,ws_order_number#254,ws_quantity#255,ws_wholesale_cost#256,ws_list_price#257,ws_sales_price#258,ws_ext_discount_amt#259,ws_ext_sales_price#260,ws_ext_wholesale_cost#261,ws_ext_list_price#262,ws_ext_tax#263,ws_coupon_amt#264,ws_ext_ship_cost#265,ws_net_paid#266,ws_net_paid_inc_tax#267,ws_net_paid_inc_ship#268,ws_net_paid_inc_ship_tax#269,ws_net_profit#270,ws_sold_date#236}
> at scala.sys.package$.error(package.scala:27)
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9081) fillna/dropna should also fill/drop NaN values in addition to null values

2015-07-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632107#comment-14632107
 ] 

Reynold Xin commented on SPARK-9081:


[~yijieshen] can you take this one?


> fillna/dropna should also fill/drop NaN values in addition to null values
> -
>
> Key: SPARK-9081
> URL: https://issues.apache.org/jira/browse/SPARK-9081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9117) fix BooleanSimplification in case-insensitive

2015-07-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9117.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7452
[https://github.com/apache/spark/pull/7452]

> fix BooleanSimplification in case-insensitive
> -
>
> Key: SPARK-9117
> URL: https://issues.apache.org/jira/browse/SPARK-9117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9113) enable analysis check code for self join

2015-07-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9113.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7449
[https://github.com/apache/spark/pull/7449]

> enable analysis check code for self join
> 
>
> Key: SPARK-9113
> URL: https://issues.apache.org/jira/browse/SPARK-9113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Trivial
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled

2015-07-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9144:
--
Component/s: Scheduler

> Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
> ---
>
> Key: SPARK-9144
> URL: https://issues.apache.org/jira/browse/SPARK-9144
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark has an option called {{spark.localExecution.enabled}}; according to the 
> docs:
> {quote}
> Enables Spark to run certain jobs, such as first() or take() on the driver, 
> without sending tasks to the cluster. This can make certain jobs execute very 
> quickly, but may require shipping a whole partition of data to the driver.
> {quote}
> This feature ends up adding quite a bit of complexity to DAGScheduler, 
> especially in the {{runLocallyWithinThread}} method, but as far as I know 
> nobody uses this feature (I searched the mailing list and haven't seen any 
> recent mentions of the configuration nor stacktraces including the runLocally 
> method).  As a step towards scheduler complexity reduction, I propose that we 
> remove this feature and all code related to it for Spark 1.5. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled

2015-07-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9144:
--
Issue Type: Improvement  (was: New Feature)

> Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
> ---
>
> Key: SPARK-9144
> URL: https://issues.apache.org/jira/browse/SPARK-9144
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark has an option called {{spark.localExecution.enabled}}; according to the 
> docs:
> {quote}
> Enables Spark to run certain jobs, such as first() or take() on the driver, 
> without sending tasks to the cluster. This can make certain jobs execute very 
> quickly, but may require shipping a whole partition of data to the driver.
> {quote}
> This feature ends up adding quite a bit of complexity to DAGScheduler, 
> especially in the {{runLocallyWithinThread}} method, but as far as I know 
> nobody uses this feature (I searched the mailing list and haven't seen any 
> recent mentions of the configuration nor stacktraces including the runLocally 
> method).  As a step towards scheduler complexity reduction, I propose that we 
> remove this feature and all code related to it for Spark 1.5. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632075#comment-14632075
 ] 

Apache Spark commented on SPARK-9144:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7484

> Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
> ---
>
> Key: SPARK-9144
> URL: https://issues.apache.org/jira/browse/SPARK-9144
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark has an option called {{spark.localExecution.enabled}}; according to the 
> docs:
> {quote}
> Enables Spark to run certain jobs, such as first() or take() on the driver, 
> without sending tasks to the cluster. This can make certain jobs execute very 
> quickly, but may require shipping a whole partition of data to the driver.
> {quote}
> This feature ends up adding quite a bit of complexity to DAGScheduler, 
> especially in the {{runLocallyWithinThread}} method, but as far as I know 
> nobody uses this feature (I searched the mailing list and haven't seen any 
> recent mentions of the configuration nor stacktraces including the runLocally 
> method).  As a step towards scheduler complexity reduction, I propose that we 
> remove this feature and all code related to it for Spark 1.5. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9144:
---

Assignee: Josh Rosen  (was: Apache Spark)

> Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
> ---
>
> Key: SPARK-9144
> URL: https://issues.apache.org/jira/browse/SPARK-9144
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark has an option called {{spark.localExecution.enabled}}; according to the 
> docs:
> {quote}
> Enables Spark to run certain jobs, such as first() or take() on the driver, 
> without sending tasks to the cluster. This can make certain jobs execute very 
> quickly, but may require shipping a whole partition of data to the driver.
> {quote}
> This feature ends up adding quite a bit of complexity to DAGScheduler, 
> especially in the {{runLocallyWithinThread}} method, but as far as I know 
> nobody uses this feature (I searched the mailing list and haven't seen any 
> recent mentions of the configuration nor stacktraces including the runLocally 
> method).  As a step towards scheduler complexity reduction, I propose that we 
> remove this feature and all code related to it for Spark 1.5. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9144:
---

Assignee: Apache Spark  (was: Josh Rosen)

> Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
> ---
>
> Key: SPARK-9144
> URL: https://issues.apache.org/jira/browse/SPARK-9144
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Spark has an option called {{spark.localExecution.enabled}}; according to the 
> docs:
> {quote}
> Enables Spark to run certain jobs, such as first() or take() on the driver, 
> without sending tasks to the cluster. This can make certain jobs execute very 
> quickly, but may require shipping a whole partition of data to the driver.
> {quote}
> This feature ends up adding quite a bit of complexity to DAGScheduler, 
> especially in the {{runLocallyWithinThread}} method, but as far as I know 
> nobody uses this feature (I searched the mailing list and haven't seen any 
> recent mentions of the configuration nor stacktraces including the runLocally 
> method).  As a step towards scheduler complexity reduction, I propose that we 
> remove this feature and all code related to it for Spark 1.5. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8794) Column pruning isn't applied beneath sample

2015-07-17 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632074#comment-14632074
 ] 

Michael Armbrust commented on SPARK-8794:
-

Unfortunately, we typically avoid backporting anything that is not a bug fix to 
release branches.  We really want to avoid unintended regressions so that it is 
very safe for people to upgrade.

> Column pruning isn't applied beneath sample
> ---
>
> Key: SPARK-8794
> URL: https://issues.apache.org/jira/browse/SPARK-8794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Eron Wright 
>Assignee: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> I observe that certain transformations (e.g. sample) on DataFrame cause the 
> underlying relation's support for column pruning to be disregarded in 
> subsequent queries.
> I encountered this issue while using an ML pipeline with a typical dataset of 
> (label, features).   For my particular data source (which implements 
> PrunedScan), the 'features' column is expensive to compute while the 'label' 
> column is cheap.  The first stage of the pipeline - StringIndexer - operates 
> only on the label and so should be quick.   Yet I found that the 'features' 
> column would be materialized.   Upon investigation,  the issue occurs when 
> the dataset is split into train/test with sampling.   The sampling 
> transformation causes the pruning optimization to be lost.
> See this gist for a sample program demonstrating the issue:
> [https://gist.github.com/EronWright/cb5fb9af46fd810194f8]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9080) IsNaN expression

2015-07-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9080.

   Resolution: Fixed
 Assignee: Yijie Shen
Fix Version/s: 1.5.0

> IsNaN expression
> 
>
> Key: SPARK-9080
> URL: https://issues.apache.org/jira/browse/SPARK-9080
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yijie Shen
>Priority: Critical
> Fix For: 1.5.0
>
>
> Add IsNaN expression to return true if the input double/float value is NaN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-07-17 Thread Nick Buroojy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632061#comment-14632061
 ] 

Nick Buroojy commented on SPARK-8418:
-

I like this idea a lot, and think it would solve one of our main performance 
issues with the ml api.

Our data set has hundreds of string features that we need to convert into 
binary vectors. We have found the latency overhead of processing the features 
one at-a-time with a StringVectorizer (SPARK-7290) to be unbearable. We wrote a 
custom Estimator to vectorize all string columns with only a couple passes over 
the data set and found significant performance gains.

I suspect that we aren't the only users with many columns, so we would love to 
fix this issue upstream with some sort of multi-column interface to 
transformers and estimators.

I suppose we could make do with the Vector or Array interface using the 
VectorAssembler as described in this ticket; however, I think the cleanest 
interface for us would be a Map from source column to dest column.

As far as sharing code, there are at least two strategies:
1) Use the single value implementation as it is today, and add a multi-value 
view on top of it. For example, StringVectorizer.setInputCols(Array[A, B]) 
would return a pipeline of [StringVectorizer.setInputCol(A), 
StringVectorizer(B)]
2) Reimplement each transformer to support a multi-value implementation and 
make the single-value interface a trivial invocation of the multi-value code. 
For example StringVectorizer.setInputCol(A) would invoke 
StringVectorizer.setInputCols(Array[A])

The obvious downside of 1 is that it wouldn't address the performance issues we 
ran into with hundreds of columns. The upsides are minimal implementation 
effort and simpler code to maintain.

The main downside of 2 is more upfront effort to implement multi-value 
transformations, but the upside is reasonable performance with "wide" data sets.

I don't think 1 and 2 are mutually exclusive. Maybe the multi-value interface 
could be solidified first with the 1 implementation, then over time the key 
transformers, like StringVectorizer, could be rewritten to 2?

You mentioned that this would require a short design doc. Can I help with that?

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9142) Removing unnecessary self types in Catalyst

2015-07-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9142.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7479
[https://github.com/apache/spark/pull/7479]

> Removing unnecessary self types in Catalyst
> ---
>
> Key: SPARK-9142
> URL: https://issues.apache.org/jira/browse/SPARK-9142
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> A small change, based on code review and offline discussion with [~dragos].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632052#comment-14632052
 ] 

Joseph K. Bradley commented on SPARK-5571:
--

Stemmer: We'll need to be careful about adding dependencies on other libraries. 
 We strongly prefer avoiding that if possible.  If code can be copied and 
modified (assuming the license is friendly to copying), that might be 
preferable if the code is relatively simple.

Stopwords: Sounds good.

LDA.runText: I'd prefer this handle everything automatically: A user gives an 
unfiltered corpus and LDA handles it.  This actually probably requires a quick 
design doc since I have not thought through the complexities.

Pipeline: I agree this might work well under the Pipelines API.  Here's what I 
propose:
* For now, we focus on adding the necessary transformers individually: stemmer, 
stopwords filter.
* For the next release, we design a good way to provide this functionality 
under Pipelines.

If that sounds good, we can create & link JIRAs for those transformers, and 
I'll move the target version for this JIRA to 1.6.  What do you think?

> LDA should handle text as well
> --
>
> Key: SPARK-5571
> URL: https://issues.apache.org/jira/browse/SPARK-5571
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
> counts.  It should also supporting training and prediction using text 
> (Strings).
> This plan is sketched in the [original LDA design 
> doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
> There should be:
> * runWithText() method which takes an RDD with a collection of Strings (bags 
> of words).  This will also index terms and compute a dictionary.
> * dictionary parameter for when LDA is run with word count vectors
> * prediction/feedback methods returning Strings (such as 
> describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers

2015-07-17 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-7690:
-
Shepherd: Ram Sriharsha

> MulticlassClassificationEvaluator for tuning Multiclass Classifiers
> ---
>
> Key: SPARK-7690
> URL: https://issues.apache.org/jira/browse/SPARK-7690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Ram Sriharsha
>Assignee: Eron Wright 
>
> Provide a MulticlassClassificationEvaluator with weighted F1-score to tune 
> multiclass classifiers using Pipeline API.
> MLLib already provides a MulticlassMetrics functionality which can be wrapped 
> around a MulticlassClassificationEvaluator to expose weighted F1-score as 
> metric.
> The functionality could be similar to 
> scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
>   in that we can support micro, macro and weighted versions of the F1-score 
> (with weighted being default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers

2015-07-17 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-7690:
-
Assignee: Eron Wright   (was: Ram Sriharsha)

> MulticlassClassificationEvaluator for tuning Multiclass Classifiers
> ---
>
> Key: SPARK-7690
> URL: https://issues.apache.org/jira/browse/SPARK-7690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Ram Sriharsha
>Assignee: Eron Wright 
>
> Provide a MulticlassClassificationEvaluator with weighted F1-score to tune 
> multiclass classifiers using Pipeline API.
> MLLib already provides a MulticlassMetrics functionality which can be wrapped 
> around a MulticlassClassificationEvaluator to expose weighted F1-score as 
> metric.
> The functionality could be similar to 
> scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
>   in that we can support micro, macro and weighted versions of the F1-score 
> (with weighted being default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled

2015-07-17 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9144:
-

 Summary: Remove DAGScheduler.runLocallyWithinThread and 
spark.localExecution.enabled
 Key: SPARK-9144
 URL: https://issues.apache.org/jira/browse/SPARK-9144
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark has an option called {{spark.localExecution.enabled}}; according to the 
docs:

{quote}
Enables Spark to run certain jobs, such as first() or take() on the driver, 
without sending tasks to the cluster. This can make certain jobs execute very 
quickly, but may require shipping a whole partition of data to the driver.
{quote}

This feature ends up adding quite a bit of complexity to DAGScheduler, 
especially in the {{runLocallyWithinThread}} method, but as far as I know 
nobody uses this feature (I searched the mailing list and haven't seen any 
recent mentions of the configuration nor stacktraces including the runLocally 
method).  As a step towards scheduler complexity reduction, I propose that we 
remove this feature and all code related to it for Spark 1.5. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column

2015-07-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631975#comment-14631975
 ] 

Reynold Xin commented on SPARK-8668:


Yes exactly!


> expr function to convert SQL expression into a Column
> -
>
> Key: SPARK-8668
> URL: https://issues.apache.org/jira/browse/SPARK-8668
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> selectExpr uses the expression parser to parse a string expressions. would be 
> great to create an "expr" function in functions.scala/functions.py that 
> converts a string into an expression (or a list of expressions separated by 
> comma).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8593) History Server doesn't show complete application when one attempt inprogress

2015-07-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8593:
-
Assignee: Rekha Joshi

> History Server doesn't show complete application when one attempt inprogress
> 
>
> Key: SPARK-8593
> URL: https://issues.apache.org/jira/browse/SPARK-8593
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
>Reporter: Thomas Graves
>Assignee: Rekha Joshi
> Fix For: 1.4.2, 1.5.0
>
>
> The Spark history server doesn't show an application if the first attempt of 
> the application is still inprogress.  
> Here are the files in hdfs:
> -rwxrwx---   3 tgraves hdfs234 2015-06-24 15:49 
> sparkhistory/application_1433751980223_18926_1.inprogress
> -rwxrwx---   3 tgraves hdfs9609450 2015-06-24 15:51 
> sparkhistory/application_1433751980223_18926_2
> The UI shows them if I set the showIncomplete=true.
> Removing the inprogress file allows it to show up when showIncomplete is 
> false.
> It should be smart enough to atleast show the second successful attempt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8593) History Server doesn't show complete application when one attempt inprogress

2015-07-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8593.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.2

Issue resolved by pull request 7253
[https://github.com/apache/spark/pull/7253]

> History Server doesn't show complete application when one attempt inprogress
> 
>
> Key: SPARK-8593
> URL: https://issues.apache.org/jira/browse/SPARK-8593
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
>Reporter: Thomas Graves
> Fix For: 1.4.2, 1.5.0
>
>
> The Spark history server doesn't show an application if the first attempt of 
> the application is still inprogress.  
> Here are the files in hdfs:
> -rwxrwx---   3 tgraves hdfs234 2015-06-24 15:49 
> sparkhistory/application_1433751980223_18926_1.inprogress
> -rwxrwx---   3 tgraves hdfs9609450 2015-06-24 15:51 
> sparkhistory/application_1433751980223_18926_2
> The UI shows them if I set the showIncomplete=true.
> Removing the inprogress file allows it to show up when showIncomplete is 
> false.
> It should be smart enough to atleast show the second successful attempt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6805) ML Pipeline API in SparkR

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6805:
---

Assignee: Apache Spark

> ML Pipeline API in SparkR
> -
>
> Key: SPARK-6805
> URL: https://issues.apache.org/jira/browse/SPARK-6805
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Critical
>
> SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
> in SparkR. The implementation should be similar to the pipeline API 
> implementation in Python.
> For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
> basic support for R formula and elastic-net regularization. The design doc 
> can be viewed at 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6805) ML Pipeline API in SparkR

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6805:
---

Assignee: (was: Apache Spark)

> ML Pipeline API in SparkR
> -
>
> Key: SPARK-6805
> URL: https://issues.apache.org/jira/browse/SPARK-6805
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
> in SparkR. The implementation should be similar to the pipeline API 
> implementation in Python.
> For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
> basic support for R formula and elastic-net regularization. The design doc 
> can be viewed at 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6805) ML Pipeline API in SparkR

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631962#comment-14631962
 ] 

Apache Spark commented on SPARK-6805:
-

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/7483

> ML Pipeline API in SparkR
> -
>
> Key: SPARK-6805
> URL: https://issues.apache.org/jira/browse/SPARK-6805
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
> in SparkR. The implementation should be similar to the pipeline API 
> implementation in Python.
> For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
> basic support for R formula and elastic-net regularization. The design doc 
> can be viewed at 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9143:
---

Assignee: Josh Rosen  (was: Apache Spark)

> Add planner rule for automatically inserting Unsafe <-> Safe row format 
> converters
> --
>
> Key: SPARK-9143
> URL: https://issues.apache.org/jira/browse/SPARK-9143
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Now that we have two different internal row formats, UnsafeRow and the old 
> Java-object-based row format, we end up having to perform conversions between 
> these two formats. These conversions should not be performed by the operators 
> themselves; instead, the planner should be responsible for inserting 
> appropriate format conversions when they are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631956#comment-14631956
 ] 

Apache Spark commented on SPARK-9143:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7482

> Add planner rule for automatically inserting Unsafe <-> Safe row format 
> converters
> --
>
> Key: SPARK-9143
> URL: https://issues.apache.org/jira/browse/SPARK-9143
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Now that we have two different internal row formats, UnsafeRow and the old 
> Java-object-based row format, we end up having to perform conversions between 
> these two formats. These conversions should not be performed by the operators 
> themselves; instead, the planner should be responsible for inserting 
> appropriate format conversions when they are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9143:
---

Assignee: Apache Spark  (was: Josh Rosen)

> Add planner rule for automatically inserting Unsafe <-> Safe row format 
> converters
> --
>
> Key: SPARK-9143
> URL: https://issues.apache.org/jira/browse/SPARK-9143
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Now that we have two different internal row formats, UnsafeRow and the old 
> Java-object-based row format, we end up having to perform conversions between 
> these two formats. These conversions should not be performed by the operators 
> themselves; instead, the planner should be responsible for inserting 
> appropriate format conversions when they are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters

2015-07-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9143:
--
Shepherd: Michael Armbrust

> Add planner rule for automatically inserting Unsafe <-> Safe row format 
> converters
> --
>
> Key: SPARK-9143
> URL: https://issues.apache.org/jira/browse/SPARK-9143
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Now that we have two different internal row formats, UnsafeRow and the old 
> Java-object-based row format, we end up having to perform conversions between 
> these two formats. These conversions should not be performed by the operators 
> themselves; instead, the planner should be responsible for inserting 
> appropriate format conversions when they are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9143) Add planner rule for automatically inserting Unsafe <-> Safe row format converters

2015-07-17 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9143:
-

 Summary: Add planner rule for automatically inserting Unsafe <-> 
Safe row format converters
 Key: SPARK-9143
 URL: https://issues.apache.org/jira/browse/SPARK-9143
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


Now that we have two different internal row formats, UnsafeRow and the old 
Java-object-based row format, we end up having to perform conversions between 
these two formats. These conversions should not be performed by the operators 
themselves; instead, the planner should be responsible for inserting 
appropriate format conversions when they are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9075) DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect.

2015-07-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631931#comment-14631931
 ] 

Joseph K. Bradley commented on SPARK-9075:
--

I agree there are ways to deal with very high-arity categories, but I think 
dealing with it is lower priority than some other improvements (such as 
providing predicted class probabilities) which we're working on.  In general, 
one should throw out that high-arity categorical feature, if you have so few 
examples.

It's true the check does not ensure all values are covered; that would be good 
to refine in the future.

It sounds like we're discussing 3 possibilities, 2 short-term and 1 long-term:
* Run without exceptions no matter what is given.
** Short-term: Run as is.  This could mean giving meaningless results.
** Long-term: We should implement a better way to handle many categories.
* Short-term: Throw exception and notify user of the problem.  I prefer this 
for now, until we can do the long-term solution.

> DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect. 
> 
>
> Key: SPARK-9075
> URL: https://issues.apache.org/jira/browse/SPARK-9075
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Les Selecky
>Priority: Minor
>
> In 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
>  there's a statement that sets maxPossibileBins to numExamples when 
> numExamples is less than strategy.maxBins. 
> This can cause an error when training small partitions; the error is 
> triggered further down in the logic where it's required that 
> maxCategoriesPerFeature be less than or equal to maxPossibleBins.
> Here's the an example of how it was manifested: the partition contained 49 
> rows (i.e., numExamples=49 but strategy.maxBins was 57.
> The maxPossibleBins = math.min(strategy.maxBins, numExamples) logic therefore 
> reduced maxPossibleBins to 49 causing the "require(maxCategoriesPerFeature <= 
> maxPossibleBins" to throw an error.
> In short, this will be a problem when training small datasets with a feature 
> that contains more categories than numExamples.
> In our local testing we commented out the "math.min(strategy.maxBins, 
> numExamples)" line and the decision tree succeeded where it had failed 
> previously.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column

2015-07-17 Thread Dan McClary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631909#comment-14631909
 ] 

Dan McClary commented on SPARK-8668:


So, if I understand, this would parse a string -- the same way selectExpr does 
-- and return a list of expressions?

> expr function to convert SQL expression into a Column
> -
>
> Key: SPARK-8668
> URL: https://issues.apache.org/jira/browse/SPARK-8668
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> selectExpr uses the expression parser to parse a string expressions. would be 
> great to create an "expr" function in functions.scala/functions.py that 
> converts a string into an expression (or a list of expressions separated by 
> comma).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631905#comment-14631905
 ] 

Apache Spark commented on SPARK-9118:
-

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/7481

> Implement integer array parameters for ml.param as IntArrayParam
> 
>
> Key: SPARK-9118
> URL: https://issues.apache.org/jira/browse/SPARK-9118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>Priority: Minor
> Fix For: 1.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> ml/param/params.scala lacks integer array parameter. It is needed for some 
> models such as multilayer perceptron to specify the layer sizes. I suggest to 
> implement it as IntArrayParam similarly to DoubleArrayParam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9118:
---

Assignee: Apache Spark

> Implement integer array parameters for ml.param as IntArrayParam
> 
>
> Key: SPARK-9118
> URL: https://issues.apache.org/jira/browse/SPARK-9118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 1.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> ml/param/params.scala lacks integer array parameter. It is needed for some 
> models such as multilayer perceptron to specify the layer sizes. I suggest to 
> implement it as IntArrayParam similarly to DoubleArrayParam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7127) Broadcast spark.ml tree ensemble models for predict

2015-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7127.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6300
[https://github.com/apache/spark/pull/6300]

> Broadcast spark.ml tree ensemble models for predict
> ---
>
> Key: SPARK-7127
> URL: https://issues.apache.org/jira/browse/SPARK-7127
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 1.5.0
>
>
> GBTRegressor/Classifier and RandomForestRegressor/Classifier should broadcast 
> models and then predict.  This will mean overriding transform().
> Note: Try to reduce duplicated code via the TreeEnsembleModel abstraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9118:
---

Assignee: (was: Apache Spark)

> Implement integer array parameters for ml.param as IntArrayParam
> 
>
> Key: SPARK-9118
> URL: https://issues.apache.org/jira/browse/SPARK-9118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>Priority: Minor
> Fix For: 1.4.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> ml/param/params.scala lacks integer array parameter. It is needed for some 
> models such as multilayer perceptron to specify the layer sizes. I suggest to 
> implement it as IntArrayParam similarly to DoubleArrayParam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8792) Add Python API for PCA transformer

2015-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-8792.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7190
[https://github.com/apache/spark/pull/7190]

> Add Python API for PCA transformer
> --
>
> Key: SPARK-8792
> URL: https://issues.apache.org/jira/browse/SPARK-8792
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 1.5.0
>
>
> Add Python API for PCA transformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9090) Fix definition of residual in LinearRegressionSummary

2015-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9090.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7435
[https://github.com/apache/spark/pull/7435]

> Fix definition of residual in LinearRegressionSummary
> -
>
> Key: SPARK-9090
> URL: https://issues.apache.org/jira/browse/SPARK-9090
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Feynman Liang
>Priority: Trivial
> Fix For: 1.5.0
>
>
> Residual is defined as label - prediction 
> (https://en.wikipedia.org/wiki/Least_squares); we need to update 
> {{LinearRegressionSummary}} to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely

2015-07-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5681:
-
Assignee: Shixiong Zhu

> Calling graceful stop() immediately after start() on StreamingContext should 
> not get stuck indefinitely
> ---
>
> Key: SPARK-5681
> URL: https://issues.apache.org/jira/browse/SPARK-5681
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>Assignee: Shixiong Zhu
> Fix For: 1.5.0
>
>
> Sometimes the receiver will be registered into tracker after ssc.stop is 
> called. Especially when stop() is called immediately after start(). So the 
> receiver doesn't get the StopReceiver message from the tracker. In this case, 
> when you call stop() in graceful mode, stop() would get stuck indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely

2015-07-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-5681.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

> Calling graceful stop() immediately after start() on StreamingContext should 
> not get stuck indefinitely
> ---
>
> Key: SPARK-5681
> URL: https://issues.apache.org/jira/browse/SPARK-5681
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>Assignee: Shixiong Zhu
> Fix For: 1.5.0
>
>
> Sometimes the receiver will be registered into tracker after ssc.stop is 
> called. Especially when stop() is called immediately after start(). So the 
> receiver doesn't get the StopReceiver message from the tracker. In this case, 
> when you call stop() in graceful mode, stop() would get stuck indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9137) Unified label verification for Predictor

2015-07-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631890#comment-14631890
 ] 

Joseph K. Bradley commented on SPARK-9137:
--

More notes: Changing title to be for Classifier only since we don't really need 
to check it for regression.
Also, just noting that this should be as lightweight as possible, happening in 
a UDF or map so that it can be pipelined without causing an extra RDD action.

> Unified label verification for Predictor
> 
>
> Key: SPARK-9137
> URL: https://issues.apache.org/jira/browse/SPARK-9137
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> We should to check label valid before training model for ml.predictor such as 
> LogisticRegression, NaiveBayes, etc. We can make this check at 
> extractLabeledPoints. Some models do this check during training step at 
> present and we need to unified them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9137) Unified label verification for Predictor

2015-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9137:
-
Assignee: Yanbo Liang

> Unified label verification for Predictor
> 
>
> Key: SPARK-9137
> URL: https://issues.apache.org/jira/browse/SPARK-9137
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> We should to check label valid before training model for ml.predictor such as 
> LogisticRegression, NaiveBayes, etc. We can make this check at 
> extractLabeledPoints. Some models do this check during training step at 
> present and we need to unified them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9136) fix several bugs in DateTimeUtils.stringToTimestamp

2015-07-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9136.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7473
[https://github.com/apache/spark/pull/7473]

> fix several bugs in DateTimeUtils.stringToTimestamp
> ---
>
> Key: SPARK-9136
> URL: https://issues.apache.org/jira/browse/SPARK-9136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8600) Naive Bayes API for spark.ml Pipelines

2015-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-8600.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7284
[https://github.com/apache/spark/pull/7284]

> Naive Bayes API for spark.ml Pipelines
> --
>
> Key: SPARK-8600
> URL: https://issues.apache.org/jira/browse/SPARK-8600
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 1.5.0
>
>
> Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the 
> existing NaiveBayes implementation under spark.mllib package. Should also 
> keep the parameter names consistent. The output columns could include both 
> the prediction and confidence scores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9062) Change output type of Tokenizer to Array(String, true)

2015-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9062:
-
Shepherd: Joseph K. Bradley

> Change output type of Tokenizer to Array(String, true)
> --
>
> Key: SPARK-9062
> URL: https://issues.apache.org/jira/browse/SPARK-9062
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently output type of Tokenizer is Array(String, false), which is not 
> compatible with Word2Vec and Other transformers since their input type is 
> Array(String, true). Seq[String] in udf will be treated as Array(String, 
> true) by default. 
> I'm also thinking for Nullable columns, maybe tokenizer should return 
> Array(null) for null value in the input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9062) Change output type of Tokenizer to Array(String, true)

2015-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9062:
-
Assignee: yuhao yang

> Change output type of Tokenizer to Array(String, true)
> --
>
> Key: SPARK-9062
> URL: https://issues.apache.org/jira/browse/SPARK-9062
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently output type of Tokenizer is Array(String, false), which is not 
> compatible with Word2Vec and Other transformers since their input type is 
> Array(String, true). Seq[String] in udf will be treated as Array(String, 
> true) by default. 
> I'm also thinking for Nullable columns, maybe tokenizer should return 
> Array(null) for null value in the input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9062) Change output type of Tokenizer to Array(String, true)

2015-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9062.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7414
[https://github.com/apache/spark/pull/7414]

> Change output type of Tokenizer to Array(String, true)
> --
>
> Key: SPARK-9062
> URL: https://issues.apache.org/jira/browse/SPARK-9062
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently output type of Tokenizer is Array(String, false), which is not 
> compatible with Word2Vec and Other transformers since their input type is 
> Array(String, true). Seq[String] in udf will be treated as Array(String, 
> true) by default. 
> I'm also thinking for Nullable columns, maybe tokenizer should return 
> Array(null) for null value in the input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9062) Change output type of Tokenizer to Array(String, true)

2015-07-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631865#comment-14631865
 ] 

Joseph K. Bradley commented on SPARK-9062:
--

I guess we will be forced to support nullable types.  Looking at Catalyst 
schema inference, it looks like the assumption of nullability is buried pretty 
deep.  I agree with you that Tokenizer (and any other transformers which use 
Array/Seq) will need to be changed to use nullable = true.

Thanks for looking into this!

> Change output type of Tokenizer to Array(String, true)
> --
>
> Key: SPARK-9062
> URL: https://issues.apache.org/jira/browse/SPARK-9062
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Currently output type of Tokenizer is Array(String, false), which is not 
> compatible with Word2Vec and Other transformers since their input type is 
> Array(String, true). Seq[String] in udf will be treated as Array(String, 
> true) by default. 
> I'm also thinking for Nullable columns, maybe tokenizer should return 
> Array(null) for null value in the input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9024) Unsafe HashJoin

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9024:
---

Assignee: Apache Spark

> Unsafe HashJoin
> ---
>
> Key: SPARK-9024
> URL: https://issues.apache.org/jira/browse/SPARK-9024
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and 
> outputs UnsafeRow as outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8999) Support non-temporal sequence in PrefixSpan

2015-07-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631857#comment-14631857
 ] 

Joseph K. Bradley commented on SPARK-8999:
--

I also wonder if that generalization to non-temporal sequences could be 
supported more easily within the Pipelines API, where we could start accepting 
generalized inputs without breaking public APIs.  In that case, this decision 
could be deferred.

> Support non-temporal sequence in PrefixSpan
> ---
>
> Key: SPARK-8999
> URL: https://issues.apache.org/jira/browse/SPARK-8999
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> In SPARK-6487, we assume that all items are ordered. However, we should 
> support non-temporal sequences in PrefixSpan. This should be done before 1.5 
> because it changes PrefixSpan APIs.
> We can use `Array[Array[Int]]` or follow SPMF to use `Array[Int]` and use -1 
> to mark itemset boundaries. The latter is more efficient for storage. If we 
> support generic item type, we can use null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9024) Unsafe HashJoin

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9024:
---

Assignee: (was: Apache Spark)

> Unsafe HashJoin
> ---
>
> Key: SPARK-9024
> URL: https://issues.apache.org/jira/browse/SPARK-9024
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and 
> outputs UnsafeRow as outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9024) Unsafe HashJoin

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631855#comment-14631855
 ] 

Apache Spark commented on SPARK-9024:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7480

> Unsafe HashJoin
> ---
>
> Key: SPARK-9024
> URL: https://issues.apache.org/jira/browse/SPARK-9024
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and 
> outputs UnsafeRow as outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9024) Unsafe HashJoin

2015-07-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-9024:
--
Summary: Unsafe HashJoin  (was: UnsafeBroadcastJoin)

> Unsafe HashJoin
> ---
>
> Key: SPARK-9024
> URL: https://issues.apache.org/jira/browse/SPARK-9024
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and 
> outputs UnsafeRow as outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9142) Removing unnecessary self types in Catalyst

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9142:
---

Assignee: Reynold Xin  (was: Apache Spark)

> Removing unnecessary self types in Catalyst
> ---
>
> Key: SPARK-9142
> URL: https://issues.apache.org/jira/browse/SPARK-9142
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> A small change, based on code review and offline discussion with [~dragos].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9142) Removing unnecessary self types in Catalyst

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631841#comment-14631841
 ] 

Apache Spark commented on SPARK-9142:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7479

> Removing unnecessary self types in Catalyst
> ---
>
> Key: SPARK-9142
> URL: https://issues.apache.org/jira/browse/SPARK-9142
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> A small change, based on code review and offline discussion with [~dragos].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9142) Removing unnecessary self types in Catalyst

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9142:
---

Assignee: Apache Spark  (was: Reynold Xin)

> Removing unnecessary self types in Catalyst
> ---
>
> Key: SPARK-9142
> URL: https://issues.apache.org/jira/browse/SPARK-9142
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> A small change, based on code review and offline discussion with [~dragos].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9142) Removing unnecessary self types in Catalyst

2015-07-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9142:
--

 Summary: Removing unnecessary self types in Catalyst
 Key: SPARK-9142
 URL: https://issues.apache.org/jira/browse/SPARK-9142
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


A small change, based on code review and offline discussion with [~dragos].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9141) DataFrame recomputed instead of using cached parent.

2015-07-17 Thread Nick Pritchard (JIRA)
Nick Pritchard created SPARK-9141:
-

 Summary: DataFrame recomputed instead of using cached parent.
 Key: SPARK-9141
 URL: https://issues.apache.org/jira/browse/SPARK-9141
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.4.1
Reporter: Nick Pritchard


As I understand, DataFrame.cache() is supposed to work the same as RDD.cache(), 
so that repeated operations on it will use the cached results and not recompute 
the entire lineage. However, it seems that some DataFrame operations (e.g. 
withColumn) change the underlying RDD lineage so that cache doesn't work as 
expected.

Below is a Scala example that demonstrates this. First, I define two UDF's that 
 use println so that it is easy to see when they are being called. Next, I 
create a simple data frame with one row and two columns. Next, I add a column, 
cache it, and call count() to force the computation. Lastly, I add another 
column, cache it, and call count().

I would have expected the last statement to only compute the last column, since 
everything else was cached. However, because withColumn() changes the lineage, 
the whole data frame is recomputed.

{code:scala}
// Examples udf's that println when called 
val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } 
val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } 

// Initial dataset 
val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") 

// Add column by applying twice udf 
val df2 = df1.withColumn("twice", twice($"value")) 
df2.cache() 
df2.count() //prints Computed: twice(1) 

// Add column by applying triple udf 
val df3 = df2.withColumn("triple", triple($"value")) 
df3.cache() 
df3.count() //prints Computed: twice(1)\nComputed: triple(1) 
{code}

I found a workaround, which helped me understand what was going on behind the 
scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD 
then back DataFrame, which seems to freeze the lineage. The code below shows 
the workaround for creating the second data frame so cache will work as 
expected.

{code:scala}
val df2 = {
  val tmp = df1.withColumn("twice", twice($"value"))
  sqlContext.createDataFrame(tmp.rdd, tmp.schema)
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8007:
---

Assignee: (was: Apache Spark)

> Support resolving virtual columns in DataFrames
> ---
>
> Key: SPARK-8007
> URL: https://issues.apache.org/jira/browse/SPARK-8007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to 
> SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631835#comment-14631835
 ] 

Apache Spark commented on SPARK-8007:
-

User 'JDrit' has created a pull request for this issue:
https://github.com/apache/spark/pull/7478

> Support resolving virtual columns in DataFrames
> ---
>
> Key: SPARK-8007
> URL: https://issues.apache.org/jira/browse/SPARK-8007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to 
> SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8007:
---

Assignee: Apache Spark

> Support resolving virtual columns in DataFrames
> ---
>
> Key: SPARK-8007
> URL: https://issues.apache.org/jira/browse/SPARK-8007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to 
> SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5269) BlockManager.dataDeserialize always creates a new serializer instance

2015-07-17 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631820#comment-14631820
 ] 

Matt Cheah commented on SPARK-5269:
---

Sweet - working with someone else on this actually, but assigning to me is 
good. I expect that using the Kryo resource pool will provide a fairly elegant 
solution.

> BlockManager.dataDeserialize always creates a new serializer instance
> -
>
> Key: SPARK-5269
> URL: https://issues.apache.org/jira/browse/SPARK-5269
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ivan Vergiliev
>Assignee: Matt Cheah
>  Labels: performance, serializers
>
> BlockManager.dataDeserialize always creates a new instance of the serializer, 
> which is pretty slow in some cases. I'm using Kryo serialization and have a 
> custom registrator, and its register method is showing up as taking about 15% 
> of the execution time in my profiles. This started happening after I 
> increased the number of keys in a job with a shuffle phase by a factor of 40.
> One solution I can think of is to create a ThreadLocal SerializerInstance for 
> the defaultSerializer, and only create a new one if a custom serializer is 
> passed in. AFAICT a custom serializer is passed only from 
> DiskStore.getValues, and that, on the other hand, depends on the serializer 
> passed to ExternalSorter. I don't know how often this is used, but I think 
> this can still be a good solution for the standard use case.
> Oh, and also - ExternalSorter already has a SerializerInstance, so if the 
> getValues method is called from a single thread, maybe we can pass that 
> directly?
> I'd be happy to try a patch but would probably need a confirmation from 
> someone that this approach would indeed work (or an idea for another).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9140) Replace TimeTracker by Stopwatch

2015-07-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9140:


 Summary: Replace TimeTracker by Stopwatch
 Key: SPARK-9140
 URL: https://issues.apache.org/jira/browse/SPARK-9140
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Priority: Minor


We can replace TImeTracker in tree implementations by Stopwatch. The initial PR 
could use local stopwatches only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9138) Vectors.dense() in Python should accept numbers directly

2015-07-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9138.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7476
[https://github.com/apache/spark/pull/7476]

> Vectors.dense() in Python should accept numbers directly
> 
>
> Key: SPARK-9138
> URL: https://issues.apache.org/jira/browse/SPARK-9138
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.0
>
>
> We already use this feature in doctests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9139) Add backwards-compatibility tests for DataType.fromJson()

2015-07-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9139:
--
Description: 
SQL's DataType.fromJson is a public API and thus must be backwards-compatible; 
there are also backwards-compatibility concerns related to persistence of 
DataType JSON in metastores.

Unfortunately, we do not have any backwards-compatibility tests which attempt 
to read old JSON values that were written by earlier versions of Spark.  
DataTypeSuite has "roundtrip" tests that test fromJson(toJson(foo)), but this 
doesn't ensure compatibility.

I think that we should address this by capuring the JSON strings produced in 
Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes 
from those strings.

This might be a good starter task for someone who wants to contribute to SQL 
tests.

  was:
SQL's DataType.fromJson is a public API and thus must be backwards-compatible; 
there are also backwards-compatibility concerns related to persistence of 
DataType JSON in metastores.

Unfortunately, we do not have any backwards-compatibility tests which attempt 
to read old JSON values that were written by earlier versions of Spark.  
DataTypeSuite has "roundtrip" tests that test fromJson(toJson(x)), but this 
doesn't ensure compatibility.

I think that we should address this by capuring the JSON strings produced in 
Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes 
from those strings.

This might be a good starter task for someone who wants to contribute to SQL 
tests.


> Add backwards-compatibility tests for DataType.fromJson()
> -
>
> Key: SPARK-9139
> URL: https://issues.apache.org/jira/browse/SPARK-9139
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Josh Rosen
>Priority: Critical
>
> SQL's DataType.fromJson is a public API and thus must be 
> backwards-compatible; there are also backwards-compatibility concerns related 
> to persistence of DataType JSON in metastores.
> Unfortunately, we do not have any backwards-compatibility tests which attempt 
> to read old JSON values that were written by earlier versions of Spark.  
> DataTypeSuite has "roundtrip" tests that test fromJson(toJson(foo)), but this 
> doesn't ensure compatibility.
> I think that we should address this by capuring the JSON strings produced in 
> Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes 
> from those strings.
> This might be a good starter task for someone who wants to contribute to SQL 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9139) Add backwards-compatibility tests for DataType.fromJson()

2015-07-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9139:
--
Component/s: SQL

> Add backwards-compatibility tests for DataType.fromJson()
> -
>
> Key: SPARK-9139
> URL: https://issues.apache.org/jira/browse/SPARK-9139
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Josh Rosen
>Priority: Critical
>
> SQL's DataType.fromJson is a public API and thus must be 
> backwards-compatible; there are also backwards-compatibility concerns related 
> to persistence of DataType JSON in metastores.
> Unfortunately, we do not have any backwards-compatibility tests which attempt 
> to read old JSON values that were written by earlier versions of Spark.  
> DataTypeSuite has "roundtrip" tests that test fromJson(toJson(foo)), but this 
> doesn't ensure compatibility.
> I think that we should address this by capuring the JSON strings produced in 
> Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes 
> from those strings.
> This might be a good starter task for someone who wants to contribute to SQL 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9139) Add backwards-compatibility tests for DataType.fromJson()

2015-07-17 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9139:
-

 Summary: Add backwards-compatibility tests for DataType.fromJson()
 Key: SPARK-9139
 URL: https://issues.apache.org/jira/browse/SPARK-9139
 Project: Spark
  Issue Type: Test
Reporter: Josh Rosen
Priority: Critical


SQL's DataType.fromJson is a public API and thus must be backwards-compatible; 
there are also backwards-compatibility concerns related to persistence of 
DataType JSON in metastores.

Unfortunately, we do not have any backwards-compatibility tests which attempt 
to read old JSON values that were written by earlier versions of Spark.  
DataTypeSuite has "roundtrip" tests that test fromJson(toJson(x)), but this 
doesn't ensure compatibility.

I think that we should address this by capuring the JSON strings produced in 
Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes 
from those strings.

This might be a good starter task for someone who wants to contribute to SQL 
tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >