[jira] [Updated] (SPARK-6166) Limit number of in flight outbound requests for shuffle fetch

2016-02-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-6166:

Description: 
spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of size.
But this is not always sufficient : when the number of hosts in the cluster 
increase, this can lead to very large number of in-bound connections to one 
more nodes - causing workers to fail under the load.

I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
number of outstanding outbound requests.
This might still cause hotspots in the cluster, but in our tests this has 
significantly reduced the occurance of worker failures.

  was:

spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of size.
But this is not always sufficient : when the number of hosts in the cluster 
increase, this can lead to very large number of in-bound connections to one 
more nodes - causing workers to fail under the load.

I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
number of outstanding outbound connections.
This might still cause hotspots in the cluster, but in our tests this has 
significantly reduced the occurance of worker failures.


> Limit number of in flight outbound requests for shuffle fetch
> -
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
> Fix For: 2.0.0
>
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound requests.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6166) Limit number of in flight outbound requests for shuffle fetch

2016-02-11 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144148#comment-15144148
 ] 

Shixiong Zhu commented on SPARK-6166:
-

Sure. Done

> Limit number of in flight outbound requests for shuffle fetch
> -
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
> Fix For: 2.0.0
>
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound requests.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6166) Limit number of in flight outbound requests for shuffle fetch

2016-02-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144147#comment-15144147
 ] 

Reynold Xin commented on SPARK-6166:


Thanks!


> Limit number of in flight outbound requests for shuffle fetch
> -
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
> Fix For: 2.0.0
>
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound requests.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6166) Limit number of in flight outbound requests for shuffle fetch

2016-02-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-6166:

Summary: Limit number of in flight outbound requests for shuffle fetch  
(was: Add config to limit number of concurrent outbound connections for shuffle 
fetch)

> Limit number of in flight outbound requests for shuffle fetch
> -
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
> Fix For: 2.0.0
>
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch

2016-02-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144146#comment-15144146
 ] 

Reynold Xin commented on SPARK-6166:


[~zsxwing] can you update the title of this ticket to something that reflects 
what's been resolved? Thanks.


> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
> Fix For: 2.0.0
>
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch

2016-02-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-6166.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
> Fix For: 2.0.0
>
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13012) Replace example code in ml-guide.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13012:
--
Assignee: Devaraj K

> Replace example code in ml-guide.md using include_example
> -
>
> Key: SPARK-13012
> URL: https://issues.apache.org/jira/browse/SPARK-13012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Devaraj K
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13012) Replace example code in ml-guide.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13012:
--
Shepherd: Xusen Yin

> Replace example code in ml-guide.md using include_example
> -
>
> Key: SPARK-13012
> URL: https://issues.apache.org/jira/browse/SPARK-13012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Devaraj K
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13019) Replace example code in mllib-statistics.md using include_example

2016-02-11 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13019:

Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in 
Jenkins builds. Then in the markdown, we can reference part of the code to show 
in the user guide. This requires adding a Jekyll tag that is similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.
{code}{% include_example 
scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code}
Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala`
 and pick code blocks marked "example" and replace code block in 
{code}{% highlight %}{code}
 in the markdown. 

See more sub-tasks in parent ticket: 
https://issues.apache.org/jira/browse/SPARK-11337

  was:See examples in other finished sub-JIRAs.


> Replace example code in mllib-statistics.md using include_example
> -
>
> Key: SPARK-13019
> URL: https://issues.apache.org/jira/browse/SPARK-13019
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example

2016-02-11 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13018:

Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in 
Jenkins builds. Then in the markdown, we can reference part of the code to show 
in the user guide. This requires adding a Jekyll tag that is similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.
{code}{% include_example 
scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}{code}
Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala`
 and pick code blocks marked "example" and replace code block in 
{code}{% highlight %}{code}
 in the markdown. 

See more sub-tasks in parent ticket: 
https://issues.apache.org/jira/browse/SPARK-11337

  was:See examples in other finished sub-JIRAs.


> Replace example code in mllib-pmml-model-export.md using include_example
> 
>
> Key: SPARK-13018
> URL: https://issues.apache.org/jira/browse/SPARK-13018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13017) Replace example code in mllib-feature-extraction.md using include_example

2016-02-11 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13017:

Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in 
Jenkins builds. Then in the markdown, we can reference part of the code to show 
in the user guide. This requires adding a Jekyll tag that is similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.
{code}{% include_example 
scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}{code}
Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` 
and pick code blocks marked "example" and replace code block in 
{code}{% highlight %}{code}
 in the markdown. 

See more sub-tasks in parent ticket: 
https://issues.apache.org/jira/browse/SPARK-11337

  was:See examples in other finished sub-JIRAs.


> Replace example code in mllib-feature-extraction.md using include_example
> -
>
> Key: SPARK-13017
> URL: https://issues.apache.org/jira/browse/SPARK-13017
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` 
> and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13015) Replace example code in mllib-data-types.md using include_example

2016-02-11 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13015:

Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in 
Jenkins builds. Then in the markdown, we can reference part of the code to show 
in the user guide. This requires adding a Jekyll tag that is similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.
{code}{% include_example 
scala/org/apache/spark/examples/mllib/LocalVectorExample.scala %}{code}
Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/mllib/LocalVectorExample.scala`
 and pick code blocks marked "example" and replace code block in 
{code}{% highlight %}{code}
 in the markdown. 

See more sub-tasks in parent ticket: 
https://issues.apache.org/jira/browse/SPARK-11337

  was:See examples in other finished sub-JIRAs.


> Replace example code in mllib-data-types.md using include_example
> -
>
> Key: SPARK-13015
> URL: https://issues.apache.org/jira/browse/SPARK-13015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/LocalVectorExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/LocalVectorExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example

2016-02-11 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13014:

Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in 
Jenkins builds. Then in the markdown, we can reference part of the code to show 
in the user guide. This requires adding a Jekyll tag that is similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.
{code}{% include_example 
scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}{code}
Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/mllib/RecommendationExample.scala`
 and pick code blocks marked "example" and replace code block in 
{code}{% highlight %}{code}
 in the markdown. 

See more sub-tasks in parent ticket: 
https://issues.apache.org/jira/browse/SPARK-11337

  was:See examples in other finished sub-JIRAs.


> Replace example code in mllib-collaborative-filtering.md using include_example
> --
>
> Key: SPARK-13014
> URL: https://issues.apache.org/jira/browse/SPARK-13014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/RecommendationExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13013) Replace example code in mllib-clustering.md using include_example

2016-02-11 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13013:

Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in 
Jenkins builds. Then in the markdown, we can reference part of the code to show 
in the user guide. This requires adding a Jekyll tag that is similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.
{code}{% include_example 
scala/org/apache/spark/examples/mllib/KMeansExample.scala %}{code}
Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` 
and pick code blocks marked "example" and replace code block in 
{code}{% highlight %}{code}
 in the markdown. 

See more sub-tasks in parent ticket: 
https://issues.apache.org/jira/browse/SPARK-11337

  was:
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in 
Jenkins builds. Then in the markdown, we can reference part of the code to show 
in the user guide. This requires adding a Jekyll tag that is similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.
{code}{% include_example scala ml.KMeansExample guide %}{code}
Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under 
{code}{% highlight %}{code}
 in the markdown. 

See more sub-tasks in parent ticket: 
https://issues.apache.org/jira/browse/SPARK-11337


> Replace example code in mllib-clustering.md using include_example
> -
>
> Key: SPARK-13013
> URL: https://issues.apache.org/jira/browse/SPARK-13013
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/KMeansExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` 
> and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13013) Replace example code in mllib-clustering.md using include_example

2016-02-11 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13013:

Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in 
Jenkins builds. Then in the markdown, we can reference part of the code to show 
in the user guide. This requires adding a Jekyll tag that is similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.
{code}{% include_example scala ml.KMeansExample guide %}{code}
Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under 
{code}{% highlight %}{code}
 in the markdown. 

See more sub-tasks in parent ticket: 
https://issues.apache.org/jira/browse/SPARK-11337

  was:See examples in other finished sub-JIRAs.


> Replace example code in mllib-clustering.md using include_example
> -
>
> Key: SPARK-13013
> URL: https://issues.apache.org/jira/browse/SPARK-13013
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example scala ml.KMeansExample guide %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record

2016-02-11 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-13295:
--
Description: 
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array which contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.



  was:
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array whith contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.




> ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new 
> instances of arrays/vectors for each record
> ---
>
> Key: SPARK-13295
> URL: https://issues.apache.org/jira/browse/SPARK-13295
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Narine Kokhlikyan
>
> As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
> AFTPoint) a new array is being created for intercept value and it is being 
> concatenated
> with another array which contains the betas, the resulted Array is being 
> converted into a Dense vector which in it's turn is being converted into 
> breeze vector. 
> This is expensive and not necessarily beautiful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record

2016-02-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13295:


Assignee: Apache Spark

> ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new 
> instances of arrays/vectors for each record
> ---
>
> Key: SPARK-13295
> URL: https://issues.apache.org/jira/browse/SPARK-13295
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Narine Kokhlikyan
>Assignee: Apache Spark
>
> As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
> AFTPoint) a new array is being created for intercept value and it is being 
> concatenated
> with another array which contains the betas, the resulted Array is being 
> converted into a Dense vector which in it's turn is being converted into 
> breeze vector. 
> This is expensive and not necessarily beautiful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record

2016-02-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13295:


Assignee: (was: Apache Spark)

> ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new 
> instances of arrays/vectors for each record
> ---
>
> Key: SPARK-13295
> URL: https://issues.apache.org/jira/browse/SPARK-13295
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Narine Kokhlikyan
>
> As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
> AFTPoint) a new array is being created for intercept value and it is being 
> concatenated
> with another array which contains the betas, the resulted Array is being 
> converted into a Dense vector which in it's turn is being converted into 
> breeze vector. 
> This is expensive and not necessarily beautiful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record

2016-02-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144067#comment-15144067
 ] 

Apache Spark commented on SPARK-13295:
--

User 'NarineK' has created a pull request for this issue:
https://github.com/apache/spark/pull/11179

> ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new 
> instances of arrays/vectors for each record
> ---
>
> Key: SPARK-13295
> URL: https://issues.apache.org/jira/browse/SPARK-13295
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Narine Kokhlikyan
>
> As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
> AFTPoint) a new array is being created for intercept value and it is being 
> concatenated
> with another array which contains the betas, the resulted Array is being 
> converted into a Dense vector which in it's turn is being converted into 
> breeze vector. 
> This is expensive and not necessarily beautiful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record

2016-02-11 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-13295:
--
Description: 
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array whith contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.



  was:
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array with contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.




> ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new 
> instances of arrays/vectors for each record
> ---
>
> Key: SPARK-13295
> URL: https://issues.apache.org/jira/browse/SPARK-13295
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Narine Kokhlikyan
>
> As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
> AFTPoint) a new array is being created for intercept value and it is being 
> concatenated
> with another array whith contains the betas, the resulted Array is being 
> converted into a Dense vector which in it's turn is being converted into 
> breeze vector. 
> This is expensive and not necessarily beautiful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record

2016-02-11 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-13295:
-

 Summary: ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - 
Avoid creating new instances of arrays/vectors for each record
 Key: SPARK-13295
 URL: https://issues.apache.org/jira/browse/SPARK-13295
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Narine Kokhlikyan


As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array with contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7889) Jobs progress of apps on complete page of HistoryServer shows uncompleted

2016-02-11 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-7889:

Assignee: Steve Loughran

> Jobs progress of apps on complete page of HistoryServer shows uncompleted
> -
>
> Key: SPARK-7889
> URL: https://issues.apache.org/jira/browse/SPARK-7889
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: meiyoula
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.0.0
>
>
> When running a SparkPi with 2000 tasks, cliking into the app on incomplete 
> page, the job progress shows 400/2000. After the app is completed, the app 
> goes to complete page from incomplete, and now cliking into the app, the  job 
> progress still shows 400/2000.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7889) Jobs progress of apps on complete page of HistoryServer shows uncompleted

2016-02-11 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-7889.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 8
[https://github.com/apache/spark/pull/8]

> Jobs progress of apps on complete page of HistoryServer shows uncompleted
> -
>
> Key: SPARK-7889
> URL: https://issues.apache.org/jira/browse/SPARK-7889
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: meiyoula
>Priority: Minor
> Fix For: 2.0.0
>
>
> When running a SparkPi with 2000 tasks, cliking into the app on incomplete 
> page, the job progress shows 400/2000. After the app is completed, the app 
> goes to complete page from incomplete, and now cliking into the app, the  job 
> progress still shows 400/2000.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-13097:
-

Assignee: Xiangrui Meng

> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Assignee: Xiangrui Meng
>Priority: Minor
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> https://github.com/apache/spark/pull/10976
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13097:
--
Assignee: Mike Seddon  (was: Xiangrui Meng)

> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Assignee: Mike Seddon
>Priority: Minor
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> https://github.com/apache/spark/pull/10976
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13097:
--
Shepherd: Xiangrui Meng  (was: Liang-Chi Hsieh)

> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Assignee: Mike Seddon
>Priority: Minor
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> https://github.com/apache/spark/pull/10976
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13097:
--
Target Version/s: 2.0.0

> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Assignee: Mike Seddon
>Priority: Minor
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> https://github.com/apache/spark/pull/10976
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13153) PySpark ML persistence failed when handle no default value parameter

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13153.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 11043
[https://github.com/apache/spark/pull/11043]

> PySpark ML persistence failed when handle no default value parameter
> 
>
> Key: SPARK-13153
> URL: https://issues.apache.org/jira/browse/SPARK-13153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Tommy Yu
>Assignee: Tommy Yu
>Priority: Minor
> Fix For: 2.0.0, 1.6.1
>
>
> This defect find when implement task spark-13033. When add below code to 
> doctest. 
> It looks like _transfer_params_from_java did not consider the params which do 
> not have default value and we should handle them. 
> >>> import os, tempfile
> >>> path = tempfile.mkdtemp()
> >>> aftsr_path = path + "/aftsr"
> >>> aftsr.save(aftsr_path)
> >>> aftsr2 = AFTSurvivalRegression.load(aftsr_path)
> Exception detail.
> ir2 = IsotonicRegression.load(ir_path)
> Exception raised:
> Traceback (most recent call last):
> File "C:\Python27\lib\doctest.py", line 1289, in run
> compileflags, 1) in test.globs
> File "", line 1, in
> ir2 = IsotonicRegression.load(ir_path)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py",
>  line 194, in load
> return cls.read().load(path)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py",
>  line 148, in load
> instance.transfer_params_from_java()
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\wrapper.py",
>  line 82, in tran
> fer_params_from_java
> value = _java2py(sc, self._java_obj.getOrDefault(java_param))
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py",
>  line 813, in
> _call
> answer, self.gateway_client, self.target_id, self.name)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\utils.py",
>  line 45, in deco
> return f(a, *kw)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py",
>  line 308, in get_
> eturn_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o351.getOrDefault.
> : java.util.NoSuchElementException: Failed to find a default value for 
> weightCol
> at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647)
> at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:646)
> at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:43)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:209)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-12746.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10697
[https://github.com/apache/spark/pull/10697]

> ArrayType(_, true) should also accept ArrayType(_, false)
> -
>
> Key: SPARK-12746
> URL: https://issues.apache.org/jira/browse/SPARK-12746
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 1.6.0
>Reporter: Earthson Lu
>Assignee: Earthson Lu
> Fix For: 2.0.0
>
>
> I see CountVectorizer has schema check for ArrayType which has 
> ArrayType(StringType, true). 
> ArrayType(String, false) is just a special case of ArrayType(String, true), 
> but it will not pass this type check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12915) SQL metrics for generated operators

2016-02-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12915.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> SQL metrics for generated operators
> ---
>
> Key: SPARK-12915
> URL: https://issues.apache.org/jira/browse/SPARK-12915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The metrics should be very efficient. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12375) VectorIndexer: allow unknown categories

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12375:
--
Assignee: yuhao yang  (was: Apache Spark)

> VectorIndexer: allow unknown categories
> ---
>
> Key: SPARK-12375
> URL: https://issues.apache.org/jira/browse/SPARK-12375
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Add option for allowing unknown categories, probably via a parameter like 
> "allowUnknownCategories."
> If true, then handle unknown categories during transform by assigning them to 
> an extra category index.
> The API should resemble the API used for StringIndexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12746:
--
Shepherd: Xiangrui Meng

> ArrayType(_, true) should also accept ArrayType(_, false)
> -
>
> Key: SPARK-12746
> URL: https://issues.apache.org/jira/browse/SPARK-12746
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 1.6.0
>Reporter: Earthson Lu
>Assignee: Earthson Lu
>
> I see CountVectorizer has schema check for ArrayType which has 
> ArrayType(StringType, true). 
> ArrayType(String, false) is just a special case of ArrayType(String, true), 
> but it will not pass this type check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11940) Python API for ml.clustering.LDA

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11940:
--
Shepherd: Yanbo Liang

> Python API for ml.clustering.LDA
> 
>
> Key: SPARK-11940
> URL: https://issues.apache.org/jira/browse/SPARK-11940
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Jeff Zhang
>
> Add Python API for ml.clustering.LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11940) Python API for ml.clustering.LDA

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11940:
--
Assignee: Jeff Zhang

> Python API for ml.clustering.LDA
> 
>
> Key: SPARK-11940
> URL: https://issues.apache.org/jira/browse/SPARK-11940
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Jeff Zhang
>
> Add Python API for ml.clustering.LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12765) CountVectorizerModel.transform lost the transformSchema

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12765:
--
Target Version/s: 2.0.0

> CountVectorizerModel.transform lost the transformSchema
> ---
>
> Key: SPARK-12765
> URL: https://issues.apache.org/jira/browse/SPARK-12765
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 1.6.1
>Reporter: sloth
>Assignee: sloth
>  Labels: patch
> Fix For: 2.0.0
>
>
> In ml package , CountVectorizerModel forgot to do transformSchema in 
> transform function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12765) CountVectorizerModel.transform lost the transformSchema

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12765:
--
Assignee: sloth

> CountVectorizerModel.transform lost the transformSchema
> ---
>
> Key: SPARK-12765
> URL: https://issues.apache.org/jira/browse/SPARK-12765
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 1.6.1
>Reporter: sloth
>Assignee: sloth
>  Labels: patch
> Fix For: 2.0.0
>
>
> In ml package , CountVectorizerModel forgot to do transformSchema in 
> transform function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12765) CountVectorizerModel.transform lost the transformSchema

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-12765.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10720
[https://github.com/apache/spark/pull/10720]

> CountVectorizerModel.transform lost the transformSchema
> ---
>
> Key: SPARK-12765
> URL: https://issues.apache.org/jira/browse/SPARK-12765
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 1.6.1
>Reporter: sloth
>  Labels: patch
> Fix For: 2.0.0
>
>
> In ml package , CountVectorizerModel forgot to do transformSchema in 
> transform function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12746:
--
Assignee: Earthson Lu

> ArrayType(_, true) should also accept ArrayType(_, false)
> -
>
> Key: SPARK-12746
> URL: https://issues.apache.org/jira/browse/SPARK-12746
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 1.6.0
>Reporter: Earthson Lu
>Assignee: Earthson Lu
>
> I see CountVectorizer has schema check for ArrayType which has 
> ArrayType(StringType, true). 
> ArrayType(String, false) is just a special case of ArrayType(String, true), 
> but it will not pass this type check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12746:
--
Target Version/s: 1.6.1, 2.0.0

> ArrayType(_, true) should also accept ArrayType(_, false)
> -
>
> Key: SPARK-12746
> URL: https://issues.apache.org/jira/browse/SPARK-12746
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 1.6.0
>Reporter: Earthson Lu
>Assignee: Earthson Lu
>
> I see CountVectorizer has schema check for ArrayType which has 
> ArrayType(StringType, true). 
> ArrayType(String, false) is just a special case of ArrayType(String, true), 
> but it will not pass this type check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13294) Don't build assembly in dev/run-tests

2016-02-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13294:


Assignee: (was: Apache Spark)

> Don't build assembly in dev/run-tests
> -
>
> Key: SPARK-13294
> URL: https://issues.apache.org/jira/browse/SPARK-13294
> Project: Spark
>  Issue Type: Improvement
>Reporter: Josh Rosen
>
> As of SPARK-9284 we should no longer need to build the full Spark assembly 
> JAR in order to run tests. Therefore, we should remove the assembly step from 
> {{dev/run-tests}} in order to reduce build + test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13294) Don't build assembly in dev/run-tests

2016-02-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143846#comment-15143846
 ] 

Apache Spark commented on SPARK-13294:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11178

> Don't build assembly in dev/run-tests
> -
>
> Key: SPARK-13294
> URL: https://issues.apache.org/jira/browse/SPARK-13294
> Project: Spark
>  Issue Type: Improvement
>Reporter: Josh Rosen
>
> As of SPARK-9284 we should no longer need to build the full Spark assembly 
> JAR in order to run tests. Therefore, we should remove the assembly step from 
> {{dev/run-tests}} in order to reduce build + test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13294) Don't build assembly in dev/run-tests

2016-02-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13294:


Assignee: Apache Spark

> Don't build assembly in dev/run-tests
> -
>
> Key: SPARK-13294
> URL: https://issues.apache.org/jira/browse/SPARK-13294
> Project: Spark
>  Issue Type: Improvement
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> As of SPARK-9284 we should no longer need to build the full Spark assembly 
> JAR in order to run tests. Therefore, we should remove the assembly step from 
> {{dev/run-tests}} in order to reduce build + test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13294) Don't build assembly in dev/run-tests

2016-02-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-13294:
--

 Summary: Don't build assembly in dev/run-tests
 Key: SPARK-13294
 URL: https://issues.apache.org/jira/browse/SPARK-13294
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen


As of SPARK-9284 we should no longer need to build the full Spark assembly JAR 
in order to run tests. Therefore, we should remove the assembly step from 
{{dev/run-tests}} in order to reduce build + test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13033) PySpark ml.regression support export/import

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13033:
--
Shepherd: Yanbo Liang
Target Version/s: 2.0.0

> PySpark ml.regression support export/import
> ---
>
> Key: SPARK-13033
> URL: https://issues.apache.org/jira/browse/SPARK-13033
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Tommy Yu
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/regression.py. Please refer the 
> implementation at SPARK-13032. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13033) PySpark ml.regression support export/import

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13033:
--
Assignee: Tommy Yu

> PySpark ml.regression support export/import
> ---
>
> Key: SPARK-13033
> URL: https://issues.apache.org/jira/browse/SPARK-13033
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Tommy Yu
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/regression.py. Please refer the 
> implementation at SPARK-13032. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13153) PySpark ML persistence failed when handle no default value parameter

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13153:
--
Shepherd: Yanbo Liang
Target Version/s: 1.6.1, 2.0.0

> PySpark ML persistence failed when handle no default value parameter
> 
>
> Key: SPARK-13153
> URL: https://issues.apache.org/jira/browse/SPARK-13153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Tommy Yu
>Assignee: Tommy Yu
>Priority: Minor
>
> This defect find when implement task spark-13033. When add below code to 
> doctest. 
> It looks like _transfer_params_from_java did not consider the params which do 
> not have default value and we should handle them. 
> >>> import os, tempfile
> >>> path = tempfile.mkdtemp()
> >>> aftsr_path = path + "/aftsr"
> >>> aftsr.save(aftsr_path)
> >>> aftsr2 = AFTSurvivalRegression.load(aftsr_path)
> Exception detail.
> ir2 = IsotonicRegression.load(ir_path)
> Exception raised:
> Traceback (most recent call last):
> File "C:\Python27\lib\doctest.py", line 1289, in run
> compileflags, 1) in test.globs
> File "", line 1, in
> ir2 = IsotonicRegression.load(ir_path)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py",
>  line 194, in load
> return cls.read().load(path)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py",
>  line 148, in load
> instance.transfer_params_from_java()
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\wrapper.py",
>  line 82, in tran
> fer_params_from_java
> value = _java2py(sc, self._java_obj.getOrDefault(java_param))
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py",
>  line 813, in
> _call
> answer, self.gateway_client, self.target_id, self.name)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\utils.py",
>  line 45, in deco
> return f(a, *kw)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py",
>  line 308, in get_
> eturn_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o351.getOrDefault.
> : java.util.NoSuchElementException: Failed to find a default value for 
> weightCol
> at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647)
> at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:646)
> at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:43)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:209)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13153) PySpark ML persistence failed when handle no default value parameter

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13153:
--
Assignee: Tommy Yu

> PySpark ML persistence failed when handle no default value parameter
> 
>
> Key: SPARK-13153
> URL: https://issues.apache.org/jira/browse/SPARK-13153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Tommy Yu
>Assignee: Tommy Yu
>Priority: Minor
>
> This defect find when implement task spark-13033. When add below code to 
> doctest. 
> It looks like _transfer_params_from_java did not consider the params which do 
> not have default value and we should handle them. 
> >>> import os, tempfile
> >>> path = tempfile.mkdtemp()
> >>> aftsr_path = path + "/aftsr"
> >>> aftsr.save(aftsr_path)
> >>> aftsr2 = AFTSurvivalRegression.load(aftsr_path)
> Exception detail.
> ir2 = IsotonicRegression.load(ir_path)
> Exception raised:
> Traceback (most recent call last):
> File "C:\Python27\lib\doctest.py", line 1289, in run
> compileflags, 1) in test.globs
> File "", line 1, in
> ir2 = IsotonicRegression.load(ir_path)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py",
>  line 194, in load
> return cls.read().load(path)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py",
>  line 148, in load
> instance.transfer_params_from_java()
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\wrapper.py",
>  line 82, in tran
> fer_params_from_java
> value = _java2py(sc, self._java_obj.getOrDefault(java_param))
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py",
>  line 813, in
> _call
> answer, self.gateway_client, self.target_id, self.name)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\utils.py",
>  line 45, in deco
> return f(a, *kw)
> File 
> "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py",
>  line 308, in get_
> eturn_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o351.getOrDefault.
> : java.util.NoSuchElementException: Failed to find a default value for 
> weightCol
> at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647)
> at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:646)
> at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:43)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:209)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13011) K-means wrapper in SparkR

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13011:
--
Shepherd: Xiangrui Meng

> K-means wrapper in SparkR
> -
>
> Key: SPARK-13011
> URL: https://issues.apache.org/jira/browse/SPARK-13011
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Implement a simple wrapper in SparkR to support k-means.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13293) Generate code for Expand

2016-02-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13293:


Assignee: Davies Liu  (was: Apache Spark)

> Generate code for Expand
> 
>
> Key: SPARK-13293
> URL: https://issues.apache.org/jira/browse/SPARK-13293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13293) Generate code for Expand

2016-02-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143802#comment-15143802
 ] 

Apache Spark commented on SPARK-13293:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11177

> Generate code for Expand
> 
>
> Key: SPARK-13293
> URL: https://issues.apache.org/jira/browse/SPARK-13293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13293) Generate code for Expand

2016-02-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13293:


Assignee: Apache Spark  (was: Davies Liu)

> Generate code for Expand
> 
>
> Key: SPARK-13293
> URL: https://issues.apache.org/jira/browse/SPARK-13293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13047.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10962
[https://github.com/apache/spark/pull/10962]

> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.0.0, 1.6.1
>
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12949) Support common expression elimination

2016-02-11 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143787#comment-15143787
 ] 

Davies Liu commented on SPARK-12949:


After some prototype, enable common expression elimination could have 10+% 
improvement on stddev, but 50% regression on Kurtosis, have not figure why, 
maybe JIT can already eliminate the common expressions (given the fact that 
Kurtosis is only 20% slower than stddev)? If yes, we may not want to do this. 

> Support common expression elimination
> -
>
> Key: SPARK-12949
> URL: https://issues.apache.org/jira/browse/SPARK-12949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13293) Generate code for Expand

2016-02-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13293:
--

 Summary: Generate code for Expand
 Key: SPARK-13293
 URL: https://issues.apache.org/jira/browse/SPARK-13293
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13069) ActorHelper is not throttled by rate limiter

2016-02-11 Thread Lin Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143768#comment-15143768
 ] 

Lin Zhao commented on SPARK-13069:
--

[~zsxwing] I created the PR, please review at your convenience. We are running 
a patched server but if this can get into 2.0.0 it would be very helpful for us.

> ActorHelper is not throttled by rate limiter
> 
>
> Key: SPARK-13069
> URL: https://issues.apache.org/jira/browse/SPARK-13069
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Lin Zhao
>
> The rate an actor receiver sends data to spark is not limited by maxRate or 
> back pressure. Spark would control how fast it writes the data to block 
> manager, but the receiver actor sends events asynchronously and would fill 
> out akka mailbox with millions of events until memory runs out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13069) ActorHelper is not throttled by rate limiter

2016-02-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143760#comment-15143760
 ] 

Apache Spark commented on SPARK-13069:
--

User 'lin-zhao' has created a pull request for this issue:
https://github.com/apache/spark/pull/11176

> ActorHelper is not throttled by rate limiter
> 
>
> Key: SPARK-13069
> URL: https://issues.apache.org/jira/browse/SPARK-13069
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Lin Zhao
>
> The rate an actor receiver sends data to spark is not limited by maxRate or 
> back pressure. Spark would control how fast it writes the data to block 
> manager, but the receiver actor sends events asynchronously and would fill 
> out akka mailbox with millions of events until memory runs out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13069) ActorHelper is not throttled by rate limiter

2016-02-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13069:


Assignee: (was: Apache Spark)

> ActorHelper is not throttled by rate limiter
> 
>
> Key: SPARK-13069
> URL: https://issues.apache.org/jira/browse/SPARK-13069
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Lin Zhao
>
> The rate an actor receiver sends data to spark is not limited by maxRate or 
> back pressure. Spark would control how fast it writes the data to block 
> manager, but the receiver actor sends events asynchronously and would fill 
> out akka mailbox with millions of events until memory runs out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13069) ActorHelper is not throttled by rate limiter

2016-02-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13069:


Assignee: Apache Spark

> ActorHelper is not throttled by rate limiter
> 
>
> Key: SPARK-13069
> URL: https://issues.apache.org/jira/browse/SPARK-13069
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Lin Zhao
>Assignee: Apache Spark
>
> The rate an actor receiver sends data to spark is not limited by maxRate or 
> back pressure. Spark would control how fast it writes the data to block 
> manager, but the receiver actor sends events asynchronously and would fill 
> out akka mailbox with millions of events until memory runs out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7483:
-
Shepherd: Sean Owen

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13035) PySpark ml.clustering support export/import

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13035:
--
Assignee: Yanbo Liang

> PySpark ml.clustering support export/import
> ---
>
> Key: SPARK-13035
> URL: https://issues.apache.org/jira/browse/SPARK-13035
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/clustering.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13035) PySpark ml.clustering support export/import

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13035.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10999
[https://github.com/apache/spark/pull/10999]

> PySpark ml.clustering support export/import
> ---
>
> Key: SPARK-13035
> URL: https://issues.apache.org/jira/browse/SPARK-13035
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/clustering.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13035) PySpark ml.clustering support export/import

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13035:
--
Target Version/s: 2.0.0

> PySpark ml.clustering support export/import
> ---
>
> Key: SPARK-13035
> URL: https://issues.apache.org/jira/browse/SPARK-13035
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/clustering.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13037) PySpark ml.recommendation support export/import

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13037:
--
Target Version/s: 2.0.0

> PySpark ml.recommendation support export/import
> ---
>
> Key: SPARK-13037
> URL: https://issues.apache.org/jira/browse/SPARK-13037
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Kai Jiang
> Fix For: 2.0.0
>
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/recommendation.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13037) PySpark ml.recommendation support export/import

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13037.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11044
[https://github.com/apache/spark/pull/11044]

> PySpark ml.recommendation support export/import
> ---
>
> Key: SPARK-13037
> URL: https://issues.apache.org/jira/browse/SPARK-13037
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
> Fix For: 2.0.0
>
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/recommendation.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13037) PySpark ml.recommendation support export/import

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13037:
--
Assignee: Kai Jiang

> PySpark ml.recommendation support export/import
> ---
>
> Key: SPARK-13037
> URL: https://issues.apache.org/jira/browse/SPARK-13037
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Kai Jiang
> Fix For: 2.0.0
>
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/recommendation.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13047:
--
Target Version/s: 1.6.1, 2.0.0  (was: 1.6.1)

> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13047:
--
Assignee: Seth Hendrickson

> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13047:
--
Target Version/s: 1.6.1

> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13019) Replace example code in mllib-statistics.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13019:
--
Shepherd: Xusen Yin

> Replace example code in mllib-statistics.md using include_example
> -
>
> Key: SPARK-13019
> URL: https://issues.apache.org/jira/browse/SPARK-13019
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13019) Replace example code in mllib-statistics.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13019:
--
Assignee: Xin Ren

> Replace example code in mllib-statistics.md using include_example
> -
>
> Key: SPARK-13019
> URL: https://issues.apache.org/jira/browse/SPARK-13019
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13013) Replace example code in mllib-clustering.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13013:
--
Assignee: Xin Ren

> Replace example code in mllib-clustering.md using include_example
> -
>
> Key: SPARK-13013
> URL: https://issues.apache.org/jira/browse/SPARK-13013
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13013) Replace example code in mllib-clustering.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13013:
--
Shepherd: Xusen Yin

> Replace example code in mllib-clustering.md using include_example
> -
>
> Key: SPARK-13013
> URL: https://issues.apache.org/jira/browse/SPARK-13013
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13015) Replace example code in mllib-data-types.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13015:
--
Assignee: Xin Ren

> Replace example code in mllib-data-types.md using include_example
> -
>
> Key: SPARK-13015
> URL: https://issues.apache.org/jira/browse/SPARK-13015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13018:
--
Assignee: Xin Ren

> Replace example code in mllib-pmml-model-export.md using include_example
> 
>
> Key: SPARK-13018
> URL: https://issues.apache.org/jira/browse/SPARK-13018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13018:
--
Shepherd: Xusen Yin  (was: Xusen Yin)

> Replace example code in mllib-pmml-model-export.md using include_example
> 
>
> Key: SPARK-13018
> URL: https://issues.apache.org/jira/browse/SPARK-13018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13014:
--
Assignee: Xin Ren

> Replace example code in mllib-collaborative-filtering.md using include_example
> --
>
> Key: SPARK-13014
> URL: https://issues.apache.org/jira/browse/SPARK-13014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13015) Replace example code in mllib-data-types.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13015:
--
Shepherd: Xusen Yin

> Replace example code in mllib-data-types.md using include_example
> -
>
> Key: SPARK-13015
> URL: https://issues.apache.org/jira/browse/SPARK-13015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13018:
--
Shepherd: Xusen Yin

> Replace example code in mllib-pmml-model-export.md using include_example
> 
>
> Key: SPARK-13018
> URL: https://issues.apache.org/jira/browse/SPARK-13018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13014:
--
Shepherd: Xusen Yin

> Replace example code in mllib-collaborative-filtering.md using include_example
> --
>
> Key: SPARK-13014
> URL: https://issues.apache.org/jira/browse/SPARK-13014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13016) Replace example code in mllib-dimensionality-reduction.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13016:
--
Assignee: Devaraj K

> Replace example code in mllib-dimensionality-reduction.md using 
> include_example
> ---
>
> Key: SPARK-13016
> URL: https://issues.apache.org/jira/browse/SPARK-13016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Devaraj K
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13016) Replace example code in mllib-dimensionality-reduction.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13016:
--
Shepherd: Xusen Yin

> Replace example code in mllib-dimensionality-reduction.md using 
> include_example
> ---
>
> Key: SPARK-13016
> URL: https://issues.apache.org/jira/browse/SPARK-13016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13017) Replace example code in mllib-feature-extraction.md using include_example

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13017:
--
Shepherd: Xusen Yin
Assignee: Xin Ren

> Replace example code in mllib-feature-extraction.md using include_example
> -
>
> Key: SPARK-13017
> URL: https://issues.apache.org/jira/browse/SPARK-13017
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13292) QuantileDiscretizer should take random seed in PySpark

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13292:
--
Description: SPARK-11515 for the Python API.

> QuantileDiscretizer should take random seed in PySpark
> --
>
> Key: SPARK-13292
> URL: https://issues.apache.org/jira/browse/SPARK-13292
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>
> SPARK-11515 for the Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13292) QuantileDiscretizer should take random seed in PySpark

2016-02-11 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-13292:
-

 Summary: QuantileDiscretizer should take random seed in PySpark
 Key: SPARK-13292
 URL: https://issues.apache.org/jira/browse/SPARK-13292
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Yu Ishikawa
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11515) QuantileDiscretizer should take random seed

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11515:
--
Target Version/s: 2.0.0

> QuantileDiscretizer should take random seed
> ---
>
> Key: SPARK-11515
> URL: https://issues.apache.org/jira/browse/SPARK-11515
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Minor
> Fix For: 2.0.0
>
>
> QuantileDiscretizer takes a random sample to select bins.  It currently does 
> not specify a seed for the XORShiftRandom, but it should take a seed by 
> extending the HasSeed Param.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11515) QuantileDiscretizer should take random seed

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11515:
--
Assignee: Yu Ishikawa

> QuantileDiscretizer should take random seed
> ---
>
> Key: SPARK-11515
> URL: https://issues.apache.org/jira/browse/SPARK-11515
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Minor
> Fix For: 2.0.0
>
>
> QuantileDiscretizer takes a random sample to select bins.  It currently does 
> not specify a seed for the XORShiftRandom, but it should take a seed by 
> extending the HasSeed Param.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11515) QuantileDiscretizer should take random seed

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11515.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9535
[https://github.com/apache/spark/pull/9535]

> QuantileDiscretizer should take random seed
> ---
>
> Key: SPARK-11515
> URL: https://issues.apache.org/jira/browse/SPARK-11515
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> QuantileDiscretizer takes a random sample to select bins.  It currently does 
> not specify a seed for the XORShiftRandom, but it should take a seed by 
> extending the HasSeed Param.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13265) Refactoring of basic ML import/export for other file system besides HDFS

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13265.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 11151
[https://github.com/apache/spark/pull/11151]

> Refactoring of basic ML import/export for other file system besides HDFS
> 
>
> Key: SPARK-13265
> URL: https://issues.apache.org/jira/browse/SPARK-13265
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
> Fix For: 2.0.0, 1.6.1
>
>
> We can't save a model into other file system besides HDFS, for example Amazon 
> S3. Because the file system is fixed at Spark 1.6.
> https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L78
> When I tried to export a KMeans model into Amazon S3, I got the error.
> {noformat}
> scala> val kmeans = new KMeans().setK(2)
> scala> val model = kmeans.fit(train)
> scala> model.write.overwrite().save("s3n://test-bucket/tmp/test-kmeans/")
> java.lang.IllegalArgumentException: Wrong FS: 
> s3n://test-bucket/tmp/test-kmeans, expected: 
> hdfs://ec2-54-248-42-97.ap-northeast-1.compute.amazonaws.c
> om:9000
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:590)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:170)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:803)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:80)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:45)
> at $iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC.(:51)
> at $iwC.(:53)
> at (:55)
> at .(:59)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at

[jira] [Updated] (SPARK-13265) Refactoring of basic ML import/export for other file system besides HDFS

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13265:
--
Assignee: Yu Ishikawa

> Refactoring of basic ML import/export for other file system besides HDFS
> 
>
> Key: SPARK-13265
> URL: https://issues.apache.org/jira/browse/SPARK-13265
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>
> We can't save a model into other file system besides HDFS, for example Amazon 
> S3. Because the file system is fixed at Spark 1.6.
> https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L78
> When I tried to export a KMeans model into Amazon S3, I got the error.
> {noformat}
> scala> val kmeans = new KMeans().setK(2)
> scala> val model = kmeans.fit(train)
> scala> model.write.overwrite().save("s3n://test-bucket/tmp/test-kmeans/")
> java.lang.IllegalArgumentException: Wrong FS: 
> s3n://test-bucket/tmp/test-kmeans, expected: 
> hdfs://ec2-54-248-42-97.ap-northeast-1.compute.amazonaws.c
> om:9000
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:590)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:170)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:803)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:80)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:45)
> at $iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC.(:51)
> at $iwC.(:53)
> at (:55)
> at .(:59)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.sp

[jira] [Updated] (SPARK-13265) Refactoring of basic ML import/export for other file system besides HDFS

2016-02-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13265:
--
Target Version/s: 1.6.1, 2.0.0

> Refactoring of basic ML import/export for other file system besides HDFS
> 
>
> Key: SPARK-13265
> URL: https://issues.apache.org/jira/browse/SPARK-13265
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>
> We can't save a model into other file system besides HDFS, for example Amazon 
> S3. Because the file system is fixed at Spark 1.6.
> https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L78
> When I tried to export a KMeans model into Amazon S3, I got the error.
> {noformat}
> scala> val kmeans = new KMeans().setK(2)
> scala> val model = kmeans.fit(train)
> scala> model.write.overwrite().save("s3n://test-bucket/tmp/test-kmeans/")
> java.lang.IllegalArgumentException: Wrong FS: 
> s3n://test-bucket/tmp/test-kmeans, expected: 
> hdfs://ec2-54-248-42-97.ap-northeast-1.compute.amazonaws.c
> om:9000
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:590)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:170)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:803)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:80)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:45)
> at $iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC.(:51)
> at $iwC.(:53)
> at (:55)
> at .(:59)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.

[jira] [Commented] (SPARK-13279) Scheduler does O(N^2) operation when adding a new task set (making it prohibitively slow for scheduling 200K tasks)

2016-02-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143629#comment-15143629
 ] 

Apache Spark commented on SPARK-13279:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/11175

> Scheduler does O(N^2) operation when adding a new task set (making it 
> prohibitively slow for scheduling 200K tasks)
> ---
>
> Key: SPARK-13279
> URL: https://issues.apache.org/jira/browse/SPARK-13279
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
>
> For each task that the TaskSetManager adds, it iterates through the entire 
> list of existing tasks to check if it's there.  As a result, scheduling a new 
> task set is O(N^2), which can be slow for large task sets.
> This is a bug that was introduced by 
> https://github.com/apache/spark/commit/3535b91: that commit removed the 
> "!readding" condition from the if-statement, but since the re-adding 
> parameter defaulted to false, that commit should have removed the condition 
> check in the if-statement altogether.
> -
> We discovered this bug while running a large pipeline with 200k tasks, when 
> we found that the executors were not able to register with the driver because 
> the driver was stuck holding a global lock in TaskSchedulerImpl.submitTasks 
> function for a long time (it wasn't deadlocked -- just taking a long time). 
> jstack of the driver - http://pastebin.com/m8CP6VMv
> executor log - http://pastebin.com/2NPS1mXC
> From the jstack I see that the thread handing the resource offer from 
> executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
> "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer 
> when adding a pending tasks. So when we have 200k pending tasks, because of 
> this o(n2) operations, the driver is just hung for more than 5 minutes. 
> Solution -   In addPendingTask function, we don't really need a duplicate 
> check. It's okay if we add a task to the same queue twice because 
> dequeueTaskFromList will skip already-running tasks. 
> Please note that this is a regression from Spark 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13277) ANTLR ignores other rule using the USING keyword

2016-02-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143571#comment-15143571
 ] 

Apache Spark commented on SPARK-13277:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/11174

> ANTLR ignores other rule using the USING keyword
> 
>
> Key: SPARK-13277
> URL: https://issues.apache.org/jira/browse/SPARK-13277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.0.0
>
>
> ANTLR currently emits the following warning during compilation:
> {noformat}
> warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: 
> Decision can match input such as "KW_USING Identifier" using multiple 
> alternatives: 2, 3
> As a result, alternative(s) 3 were disabled for that input
> {noformat}
> This means that some of the functionality of the parser is disabled. This is 
> introduced by the migration of the DDLParsers 
> (https://github.com/apache/spark/pull/10723). We should be able to fix this 
> by introducing a syntactic predicate for USING.
> cc [~viirya]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13291) Numerical models should preserve label attributes

2016-02-11 Thread Piotr Smolinski (JIRA)
Piotr Smolinski created SPARK-13291:
---

 Summary: Numerical models should preserve label attributes
 Key: SPARK-13291
 URL: https://issues.apache.org/jira/browse/SPARK-13291
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.6.0
Reporter: Piotr Smolinski
Priority: Minor


I tried building a simple pipeline for Random Forest classification. The 
predictors are some doubles, some ints and some strings. The response is 
string. The RFormula seems to be a perfect candidate. RFormulaModel produces 
nicely *labelCol* column with StringIndexer derived metadata and 
RandomForestClassificationModel converts the *featuresCol* to *predictionCol*. 
The problem is that there is no way to convert the *predictionCol* (which is 
factor index) back to the label. The metadata created by StringIndexer is lost. 

The numerical models should create the *predictionCol* columns with metadata 
seen on the *labelCol* column during the model fitting.

Preserving metadata allows for example to pipeline RFormula, 
RandomForestClassifier and IndexToString.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13290) wholeTextFile and binaryFiles are really slow

2016-02-11 Thread mathieu longtin (JIRA)
mathieu longtin created SPARK-13290:
---

 Summary: wholeTextFile and binaryFiles are really slow
 Key: SPARK-13290
 URL: https://issues.apache.org/jira/browse/SPARK-13290
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.6.0
 Environment: Linux stand-alone
Reporter: mathieu longtin


Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely 
slow. It takes 3 minutes in Java versus 2.5 seconds in Python.

The java process balloons to 4.3GB of memory and uses 100% CPU the whole time. 
I suspects Spark reads it in small chunks and assembles it at the end, hence 
the large amount of CPU.

{code}
In [49]: rdd = sc.binaryFiles(pathToOneFile)
In [50]: %time path, text = rdd.first()
CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s
Wall time: 3min 32s
In [51]: len(text)
Out[51]: 191376122
In [52]: %time text = open(pathToOneFile).read()
CPU times: user 8 ms, sys: 691 ms, total: 699 ms
Wall time: 2.43 s
In [53]: len(text)
Out[53]: 191376122
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8162) Run spark-shell cause NullPointerException

2016-02-11 Thread Matthew Campbell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143513#comment-15143513
 ] 

Matthew Campbell edited comment on SPARK-8162 at 2/11/16 9:14 PM:
--

I'm also running into this problem with the latest spark-1.6.0-bin-hadoop2.6 on 
a Windows 7 machine.


was (Author: mtthwcmpbll):
I'm also running into this problem with the latest spark-1.6.0-bin-hadoop2.6.

> Run spark-shell cause NullPointerException
> --
>
> Key: SPARK-8162
> URL: https://issues.apache.org/jira/browse/SPARK-8162
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Weizhong
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.4.1, 1.5.0
>
>
> run spark-shell on latest master branch, then failed, details are:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
> Type in expressions to have them evaluated.
> Type :help for more information.
> error: error while loading JobProgressListener, Missing dependency 'bad 
> symbolic reference. A signature in JobProgressListener.class refers to term 
> annotations
> in package com.google.common which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling 
> JobProgressListener.class.', required by 
> /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class)
> java.lang.NullPointerException
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:68)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
>   at $iwC$$iwC.(:9)
>   at $iwC.(:18)
>   at (:20)
>   at .(:24)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
>   at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
>   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
>   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
>   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
>   at 
> org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apa

[jira] [Commented] (SPARK-8162) Run spark-shell cause NullPointerException

2016-02-11 Thread Matthew Campbell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143513#comment-15143513
 ] 

Matthew Campbell commented on SPARK-8162:
-

I'm also running into this problem with the latest spark-1.6.0-bin-hadoop2.6.

> Run spark-shell cause NullPointerException
> --
>
> Key: SPARK-8162
> URL: https://issues.apache.org/jira/browse/SPARK-8162
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Weizhong
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.4.1, 1.5.0
>
>
> run spark-shell on latest master branch, then failed, details are:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
> Type in expressions to have them evaluated.
> Type :help for more information.
> error: error while loading JobProgressListener, Missing dependency 'bad 
> symbolic reference. A signature in JobProgressListener.class refers to term 
> annotations
> in package com.google.common which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling 
> JobProgressListener.class.', required by 
> /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class)
> java.lang.NullPointerException
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:68)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
>   at $iwC$$iwC.(:9)
>   at $iwC.(:18)
>   at (:20)
>   at .(:24)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
>   at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
>   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
>   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
>   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
>   at 
> org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(

[jira] [Created] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-02-11 Thread Qi Dai (JIRA)
Qi Dai created SPARK-13289:
--

 Summary: Word2Vec generate infinite distances when numIterations>5
 Key: SPARK-13289
 URL: https://issues.apache.org/jira/browse/SPARK-13289
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.6.0
 Environment: Linux, Scala
Reporter: Qi Dai


I recently ran some word2vec experiments on a cluster with 50 executors on some 
large text dataset but find out that when number of iterations is larger than 5 
the distance between words will be all infinite. My code looks like this:

val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
").toSeq)
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
val word2vec = new 
Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
val model = word2vec.fit(text)
val synonyms = model.findSynonyms("who", 40)
for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}

The results are: 
to Infinity
and Infinity
that Infinity
with Infinity
said Infinity
it Infinity
by Infinity
be Infinity
have Infinity
he Infinity
has Infinity
his Infinity
an Infinity
) Infinity
not Infinity
who Infinity
I Infinity
had Infinity
their Infinity
were Infinity
they Infinity
but Infinity
been Infinity

I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13288) [1.6.0] Memory leak in Spark streaming

2016-02-11 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-13288:
--

 Summary: [1.6.0] Memory leak in Spark streaming
 Key: SPARK-13288
 URL: https://issues.apache.org/jira/browse/SPARK-13288
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.6.0
 Environment: Bare metal cluster
RHEL 6.6


Reporter: JESSE CHEN


Streaming in 1.6 seems to have a memory leak.

Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 
showed a gradual increasing processing time. 

The app is simple: 1 Kafka receiver of tweet stream and 20 executors processing 
the tweets in 5-second batches. 

Spark 1.5.0 handles this smoothly and did not show increasing processing time 
in the 40-minute test; but 1.6 showed increasing time about 8 minutes into the 
test. Please see chart here:

https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116

I captured heap dumps in two version and did a comparison. I noticed the Byte 
is using 50X more space in 1.5.1.

Here are some top classes in heap histogram and references. 

Heap Histogram  

All Classes (excluding platform)
1.6.0 Streaming 1.5.1 Streaming 
Class   Instance Count  Total Size  Class   Instance Count  Total 
Size
class [B84533,227,649,599   class [B5095
62,938,466
class [C44682   4,255,502   class [C130482  
12,844,182
class java.lang.reflect.Method  90591,177,670   class 
java.lang.String  130171  1,562,052


References by Type  References by Type  

class [B [0x640039e38]  class [B [0x6c020bb08]  

Referrers by Type   Referrers by Type   

Class   Count   Class   Count   
java.nio.HeapByteBuffer 3239
sun.security.util.DerInputBuffer1233
sun.security.util.DerInputBuffer1233
sun.security.util.ObjectIdentifier  620 
sun.security.util.ObjectIdentifier  620 [[B 397 
[Ljava.lang.Object; 408 java.lang.reflect.Method
326 




The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0.
The Java.nio.HeapByteBuffer referencing class did not show up in top in 1.5.1. 

I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them 
here

https://ibm.box.com/sparkstreaming-jstack160
https://ibm.box.com/sparkstreaming-jstack151

Jesse 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13287) Standalone REST API throttling?

2016-02-11 Thread Rares Vernica (JIRA)
Rares Vernica created SPARK-13287:
-

 Summary: Standalone REST API throttling?
 Key: SPARK-13287
 URL: https://issues.apache.org/jira/browse/SPARK-13287
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.6.0
Reporter: Rares Vernica
Priority: Minor


I am using the REST API provided by Spark Standalone mode to check on jobs. It 
turns out that if I don't pause between requests the server will redirect me to 
the server homepage instead of offering the requested information.

Here is a simple test to prove this:
{code:JavaScript}
$ curl --silent 
http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head 
-2 ; curl --silent 
http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head -2
[ {
  "jobId" : 0,

  
{code}

I am requesting the same information about one application twice using 
{{curl}}. I print the first two lines from each response. The requests are made 
immediately one after another. The first two lines are from the first request, 
the last two lines are from the second request. Again, the request URLs are 
identical. The response from the second request is identical with the homepage 
you get from http://localhost:8080/

If I insert a {{sleep 1}} between the two {{curl}} commands, both work fine. 
For smaller time outs, like {{sleep .8}} it does not work correctly.

I am not sure if this is intentional or a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier

2016-02-11 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-12982.
---
   Resolution: Resolved
 Assignee: Jayadevan M
Fix Version/s: 2.0.0

> SQLContext: temporary table registration does not accept valid identifier
> -
>
> Key: SPARK-12982
> URL: https://issues.apache.org/jira/browse/SPARK-12982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Assignee: Jayadevan M
>Priority: Minor
>  Labels: sql
> Fix For: 2.0.0
>
>
> We have encountered very strange behavior of SparkSQL temporary table 
> registration.
> What identifiers for temporary table should be valid?
> Alphanumerical + '_' with at least one non-digit?
> Valid identifiers:
> df
> 674123a
> 674123_
> a0e97c59_4445_479d_a7ef_d770e3874123
> 1ae97c59_4445_479d_a7ef_d770e3874123
> Invalid identifier:
> 10e97c59_4445_479d_a7ef_d770e3874123
> Stack trace:
> {code:xml}
> java.lang.RuntimeException: [1.1] failure: identifier expected
> 10e97c59_4445_479d_a7ef_d770e3874123
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
>   at 
> SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:9)
>   at 
> SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:42)
>   at 
> SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at sbt.Run.invokeMain(Run.scala:67)
>   at sbt.Run.run0(Run.scala:61)
>   at sbt.Run.sbt$Run$$execute$1(Run.scala:51)
>   at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55)
>   at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
>   at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
>   at sbt.Logger$$anon$4.apply(Logger.scala:85)
>   at sbt.TrapExit$App.run(TrapExit.scala:248)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Code to reproduce this bug:
> https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >