[jira] [Updated] (SPARK-6166) Limit number of in flight outbound requests for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-6166: Description: spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of size. But this is not always sufficient : when the number of hosts in the cluster increase, this can lead to very large number of in-bound connections to one more nodes - causing workers to fail under the load. I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on number of outstanding outbound requests. This might still cause hotspots in the cluster, but in our tests this has significantly reduced the occurance of worker failures. was: spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of size. But this is not always sufficient : when the number of hosts in the cluster increase, this can lead to very large number of in-bound connections to one more nodes - causing workers to fail under the load. I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on number of outstanding outbound connections. This might still cause hotspots in the cluster, but in our tests this has significantly reduced the occurance of worker failures. > Limit number of in flight outbound requests for shuffle fetch > - > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > Fix For: 2.0.0 > > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound requests. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Limit number of in flight outbound requests for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144148#comment-15144148 ] Shixiong Zhu commented on SPARK-6166: - Sure. Done > Limit number of in flight outbound requests for shuffle fetch > - > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > Fix For: 2.0.0 > > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound requests. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Limit number of in flight outbound requests for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144147#comment-15144147 ] Reynold Xin commented on SPARK-6166: Thanks! > Limit number of in flight outbound requests for shuffle fetch > - > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > Fix For: 2.0.0 > > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound requests. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6166) Limit number of in flight outbound requests for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-6166: Summary: Limit number of in flight outbound requests for shuffle fetch (was: Add config to limit number of concurrent outbound connections for shuffle fetch) > Limit number of in flight outbound requests for shuffle fetch > - > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > Fix For: 2.0.0 > > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144146#comment-15144146 ] Reynold Xin commented on SPARK-6166: [~zsxwing] can you update the title of this ticket to something that reflects what's been resolved? Thanks. > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > Fix For: 2.0.0 > > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-6166. - Resolution: Fixed Fix Version/s: 2.0.0 > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > Fix For: 2.0.0 > > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13012) Replace example code in ml-guide.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13012: -- Assignee: Devaraj K > Replace example code in ml-guide.md using include_example > - > > Key: SPARK-13012 > URL: https://issues.apache.org/jira/browse/SPARK-13012 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Devaraj K >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13012) Replace example code in ml-guide.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13012: -- Shepherd: Xusen Yin > Replace example code in ml-guide.md using include_example > - > > Key: SPARK-13012 > URL: https://issues.apache.org/jira/browse/SPARK-13012 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Devaraj K >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13019) Replace example code in mllib-statistics.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-13019: Description: The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. {code}{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code} Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in {code}{% highlight %}{code} in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 was:See examples in other finished sub-JIRAs. > Replace example code in mllib-statistics.md using include_example > - > > Key: SPARK-13019 > URL: https://issues.apache.org/jira/browse/SPARK-13019 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-13018: Description: The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. {code}{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}{code} Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala` and pick code blocks marked "example" and replace code block in {code}{% highlight %}{code} in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 was:See examples in other finished sub-JIRAs. > Replace example code in mllib-pmml-model-export.md using include_example > > > Key: SPARK-13018 > URL: https://issues.apache.org/jira/browse/SPARK-13018 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13017) Replace example code in mllib-feature-extraction.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-13017: Description: The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. {code}{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}{code} Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` and pick code blocks marked "example" and replace code block in {code}{% highlight %}{code} in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 was:See examples in other finished sub-JIRAs. > Replace example code in mllib-feature-extraction.md using include_example > - > > Key: SPARK-13017 > URL: https://issues.apache.org/jira/browse/SPARK-13017 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13015) Replace example code in mllib-data-types.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-13015: Description: The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. {code}{% include_example scala/org/apache/spark/examples/mllib/LocalVectorExample.scala %}{code} Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/LocalVectorExample.scala` and pick code blocks marked "example" and replace code block in {code}{% highlight %}{code} in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 was:See examples in other finished sub-JIRAs. > Replace example code in mllib-data-types.md using include_example > - > > Key: SPARK-13015 > URL: https://issues.apache.org/jira/browse/SPARK-13015 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/LocalVectorExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/LocalVectorExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-13014: Description: The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. {code}{% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}{code} Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/RecommendationExample.scala` and pick code blocks marked "example" and replace code block in {code}{% highlight %}{code} in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 was:See examples in other finished sub-JIRAs. > Replace example code in mllib-collaborative-filtering.md using include_example > -- > > Key: SPARK-13014 > URL: https://issues.apache.org/jira/browse/SPARK-13014 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/RecommendationExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13013) Replace example code in mllib-clustering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-13013: Description: The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. {code}{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}{code} Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in {code}{% highlight %}{code} in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 was: The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. {code}{% include_example scala ml.KMeansExample guide %}{code} Jekyll will find `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and pick code blocks marked "example" and put them under {code}{% highlight %}{code} in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 > Replace example code in mllib-clustering.md using include_example > - > > Key: SPARK-13013 > URL: https://issues.apache.org/jira/browse/SPARK-13013 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example > scala/org/apache/spark/examples/mllib/KMeansExample.scala %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` > and pick code blocks marked "example" and replace code block in > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13013) Replace example code in mllib-clustering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-13013: Description: The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. {code}{% include_example scala ml.KMeansExample guide %}{code} Jekyll will find `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and pick code blocks marked "example" and put them under {code}{% highlight %}{code} in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 was:See examples in other finished sub-JIRAs. > Replace example code in mllib-clustering.md using include_example > - > > Key: SPARK-13013 > URL: https://issues.apache.org/jira/browse/SPARK-13013 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > Goal is to move actual example code to spark/examples and test compilation in > Jenkins builds. Then in the markdown, we can reference part of the code to > show in the user guide. This requires adding a Jekyll tag that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code}{% include_example scala ml.KMeansExample guide %}{code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` > and pick code blocks marked "example" and put them under > {code}{% highlight %}{code} > in the markdown. > See more sub-tasks in parent ticket: > https://issues.apache.org/jira/browse/SPARK-11337 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record
[ https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-13295: -- Description: As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array which contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. was: As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array whith contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. > ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new > instances of arrays/vectors for each record > --- > > Key: SPARK-13295 > URL: https://issues.apache.org/jira/browse/SPARK-13295 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Narine Kokhlikyan > > As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: > AFTPoint) a new array is being created for intercept value and it is being > concatenated > with another array which contains the betas, the resulted Array is being > converted into a Dense vector which in it's turn is being converted into > breeze vector. > This is expensive and not necessarily beautiful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record
[ https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13295: Assignee: Apache Spark > ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new > instances of arrays/vectors for each record > --- > > Key: SPARK-13295 > URL: https://issues.apache.org/jira/browse/SPARK-13295 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Narine Kokhlikyan >Assignee: Apache Spark > > As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: > AFTPoint) a new array is being created for intercept value and it is being > concatenated > with another array which contains the betas, the resulted Array is being > converted into a Dense vector which in it's turn is being converted into > breeze vector. > This is expensive and not necessarily beautiful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record
[ https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13295: Assignee: (was: Apache Spark) > ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new > instances of arrays/vectors for each record > --- > > Key: SPARK-13295 > URL: https://issues.apache.org/jira/browse/SPARK-13295 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Narine Kokhlikyan > > As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: > AFTPoint) a new array is being created for intercept value and it is being > concatenated > with another array which contains the betas, the resulted Array is being > converted into a Dense vector which in it's turn is being converted into > breeze vector. > This is expensive and not necessarily beautiful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record
[ https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144067#comment-15144067 ] Apache Spark commented on SPARK-13295: -- User 'NarineK' has created a pull request for this issue: https://github.com/apache/spark/pull/11179 > ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new > instances of arrays/vectors for each record > --- > > Key: SPARK-13295 > URL: https://issues.apache.org/jira/browse/SPARK-13295 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Narine Kokhlikyan > > As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: > AFTPoint) a new array is being created for intercept value and it is being > concatenated > with another array which contains the betas, the resulted Array is being > converted into a Dense vector which in it's turn is being converted into > breeze vector. > This is expensive and not necessarily beautiful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record
[ https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-13295: -- Description: As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array whith contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. was: As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array with contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. > ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new > instances of arrays/vectors for each record > --- > > Key: SPARK-13295 > URL: https://issues.apache.org/jira/browse/SPARK-13295 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Narine Kokhlikyan > > As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: > AFTPoint) a new array is being created for intercept value and it is being > concatenated > with another array whith contains the betas, the resulted Array is being > converted into a Dense vector which in it's turn is being converted into > breeze vector. > This is expensive and not necessarily beautiful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record
Narine Kokhlikyan created SPARK-13295: - Summary: ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record Key: SPARK-13295 URL: https://issues.apache.org/jira/browse/SPARK-13295 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Narine Kokhlikyan As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array with contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7889) Jobs progress of apps on complete page of HistoryServer shows uncompleted
[ https://issues.apache.org/jira/browse/SPARK-7889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-7889: Assignee: Steve Loughran > Jobs progress of apps on complete page of HistoryServer shows uncompleted > - > > Key: SPARK-7889 > URL: https://issues.apache.org/jira/browse/SPARK-7889 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: meiyoula >Assignee: Steve Loughran >Priority: Minor > Fix For: 2.0.0 > > > When running a SparkPi with 2000 tasks, cliking into the app on incomplete > page, the job progress shows 400/2000. After the app is completed, the app > goes to complete page from incomplete, and now cliking into the app, the job > progress still shows 400/2000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7889) Jobs progress of apps on complete page of HistoryServer shows uncompleted
[ https://issues.apache.org/jira/browse/SPARK-7889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-7889. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 8 [https://github.com/apache/spark/pull/8] > Jobs progress of apps on complete page of HistoryServer shows uncompleted > - > > Key: SPARK-7889 > URL: https://issues.apache.org/jira/browse/SPARK-7889 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: meiyoula >Priority: Minor > Fix For: 2.0.0 > > > When running a SparkPi with 2000 tasks, cliking into the app on incomplete > page, the job progress shows 400/2000. After the app is completed, the app > goes to complete page from incomplete, and now cliking into the app, the job > progress still shows 400/2000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs
[ https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-13097: - Assignee: Xiangrui Meng > Extend Binarizer to allow Double AND Vector inputs > -- > > Key: SPARK-13097 > URL: https://issues.apache.org/jira/browse/SPARK-13097 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Mike Seddon >Assignee: Xiangrui Meng >Priority: Minor > > To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in > addition to the existing Double input column type. > https://github.com/apache/spark/pull/10976 > A use case for this enhancement is for when a user wants to Binarize many > similar feature columns at once using the same threshold value. > A real-world example for this would be where the authors of one of the > leading MNIST handwriting character recognition entries converts 784 > grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's > grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this > modification the user is able to: VectorAssembler(784 > columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar > type. > This approach also allows much easier use of the ParamGridBuilder to test > multiple theshold values. > I have already written the code and unit tests and have tested in a > Multilayer perceptron classifier workflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs
[ https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13097: -- Assignee: Mike Seddon (was: Xiangrui Meng) > Extend Binarizer to allow Double AND Vector inputs > -- > > Key: SPARK-13097 > URL: https://issues.apache.org/jira/browse/SPARK-13097 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Mike Seddon >Assignee: Mike Seddon >Priority: Minor > > To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in > addition to the existing Double input column type. > https://github.com/apache/spark/pull/10976 > A use case for this enhancement is for when a user wants to Binarize many > similar feature columns at once using the same threshold value. > A real-world example for this would be where the authors of one of the > leading MNIST handwriting character recognition entries converts 784 > grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's > grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this > modification the user is able to: VectorAssembler(784 > columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar > type. > This approach also allows much easier use of the ParamGridBuilder to test > multiple theshold values. > I have already written the code and unit tests and have tested in a > Multilayer perceptron classifier workflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs
[ https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13097: -- Shepherd: Xiangrui Meng (was: Liang-Chi Hsieh) > Extend Binarizer to allow Double AND Vector inputs > -- > > Key: SPARK-13097 > URL: https://issues.apache.org/jira/browse/SPARK-13097 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Mike Seddon >Assignee: Mike Seddon >Priority: Minor > > To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in > addition to the existing Double input column type. > https://github.com/apache/spark/pull/10976 > A use case for this enhancement is for when a user wants to Binarize many > similar feature columns at once using the same threshold value. > A real-world example for this would be where the authors of one of the > leading MNIST handwriting character recognition entries converts 784 > grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's > grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this > modification the user is able to: VectorAssembler(784 > columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar > type. > This approach also allows much easier use of the ParamGridBuilder to test > multiple theshold values. > I have already written the code and unit tests and have tested in a > Multilayer perceptron classifier workflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs
[ https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13097: -- Target Version/s: 2.0.0 > Extend Binarizer to allow Double AND Vector inputs > -- > > Key: SPARK-13097 > URL: https://issues.apache.org/jira/browse/SPARK-13097 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Mike Seddon >Assignee: Mike Seddon >Priority: Minor > > To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in > addition to the existing Double input column type. > https://github.com/apache/spark/pull/10976 > A use case for this enhancement is for when a user wants to Binarize many > similar feature columns at once using the same threshold value. > A real-world example for this would be where the authors of one of the > leading MNIST handwriting character recognition entries converts 784 > grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's > grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this > modification the user is able to: VectorAssembler(784 > columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar > type. > This approach also allows much easier use of the ParamGridBuilder to test > multiple theshold values. > I have already written the code and unit tests and have tested in a > Multilayer perceptron classifier workflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13153) PySpark ML persistence failed when handle no default value parameter
[ https://issues.apache.org/jira/browse/SPARK-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13153. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 11043 [https://github.com/apache/spark/pull/11043] > PySpark ML persistence failed when handle no default value parameter > > > Key: SPARK-13153 > URL: https://issues.apache.org/jira/browse/SPARK-13153 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Tommy Yu >Assignee: Tommy Yu >Priority: Minor > Fix For: 2.0.0, 1.6.1 > > > This defect find when implement task spark-13033. When add below code to > doctest. > It looks like _transfer_params_from_java did not consider the params which do > not have default value and we should handle them. > >>> import os, tempfile > >>> path = tempfile.mkdtemp() > >>> aftsr_path = path + "/aftsr" > >>> aftsr.save(aftsr_path) > >>> aftsr2 = AFTSurvivalRegression.load(aftsr_path) > Exception detail. > ir2 = IsotonicRegression.load(ir_path) > Exception raised: > Traceback (most recent call last): > File "C:\Python27\lib\doctest.py", line 1289, in run > compileflags, 1) in test.globs > File "", line 1, in > ir2 = IsotonicRegression.load(ir_path) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py", > line 194, in load > return cls.read().load(path) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py", > line 148, in load > instance.transfer_params_from_java() > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\wrapper.py", > line 82, in tran > fer_params_from_java > value = _java2py(sc, self._java_obj.getOrDefault(java_param)) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", > line 813, in > _call > answer, self.gateway_client, self.target_id, self.name) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\utils.py", > line 45, in deco > return f(a, *kw) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", > line 308, in get_ > eturn_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o351.getOrDefault. > : java.util.NoSuchElementException: Failed to find a default value for > weightCol > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647) > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:646) > at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:43) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)
[ https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-12746. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10697 [https://github.com/apache/spark/pull/10697] > ArrayType(_, true) should also accept ArrayType(_, false) > - > > Key: SPARK-12746 > URL: https://issues.apache.org/jira/browse/SPARK-12746 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 1.6.0 >Reporter: Earthson Lu >Assignee: Earthson Lu > Fix For: 2.0.0 > > > I see CountVectorizer has schema check for ArrayType which has > ArrayType(StringType, true). > ArrayType(String, false) is just a special case of ArrayType(String, true), > but it will not pass this type check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12915) SQL metrics for generated operators
[ https://issues.apache.org/jira/browse/SPARK-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12915. - Resolution: Fixed Fix Version/s: 2.0.0 > SQL metrics for generated operators > --- > > Key: SPARK-12915 > URL: https://issues.apache.org/jira/browse/SPARK-12915 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > The metrics should be very efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12375) VectorIndexer: allow unknown categories
[ https://issues.apache.org/jira/browse/SPARK-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12375: -- Assignee: yuhao yang (was: Apache Spark) > VectorIndexer: allow unknown categories > --- > > Key: SPARK-12375 > URL: https://issues.apache.org/jira/browse/SPARK-12375 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Add option for allowing unknown categories, probably via a parameter like > "allowUnknownCategories." > If true, then handle unknown categories during transform by assigning them to > an extra category index. > The API should resemble the API used for StringIndexer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)
[ https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12746: -- Shepherd: Xiangrui Meng > ArrayType(_, true) should also accept ArrayType(_, false) > - > > Key: SPARK-12746 > URL: https://issues.apache.org/jira/browse/SPARK-12746 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 1.6.0 >Reporter: Earthson Lu >Assignee: Earthson Lu > > I see CountVectorizer has schema check for ArrayType which has > ArrayType(StringType, true). > ArrayType(String, false) is just a special case of ArrayType(String, true), > but it will not pass this type check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11940) Python API for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-11940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11940: -- Shepherd: Yanbo Liang > Python API for ml.clustering.LDA > > > Key: SPARK-11940 > URL: https://issues.apache.org/jira/browse/SPARK-11940 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Jeff Zhang > > Add Python API for ml.clustering.LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11940) Python API for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-11940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11940: -- Assignee: Jeff Zhang > Python API for ml.clustering.LDA > > > Key: SPARK-11940 > URL: https://issues.apache.org/jira/browse/SPARK-11940 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Jeff Zhang > > Add Python API for ml.clustering.LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12765) CountVectorizerModel.transform lost the transformSchema
[ https://issues.apache.org/jira/browse/SPARK-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12765: -- Target Version/s: 2.0.0 > CountVectorizerModel.transform lost the transformSchema > --- > > Key: SPARK-12765 > URL: https://issues.apache.org/jira/browse/SPARK-12765 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 1.6.1 >Reporter: sloth >Assignee: sloth > Labels: patch > Fix For: 2.0.0 > > > In ml package , CountVectorizerModel forgot to do transformSchema in > transform function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12765) CountVectorizerModel.transform lost the transformSchema
[ https://issues.apache.org/jira/browse/SPARK-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12765: -- Assignee: sloth > CountVectorizerModel.transform lost the transformSchema > --- > > Key: SPARK-12765 > URL: https://issues.apache.org/jira/browse/SPARK-12765 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 1.6.1 >Reporter: sloth >Assignee: sloth > Labels: patch > Fix For: 2.0.0 > > > In ml package , CountVectorizerModel forgot to do transformSchema in > transform function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12765) CountVectorizerModel.transform lost the transformSchema
[ https://issues.apache.org/jira/browse/SPARK-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-12765. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10720 [https://github.com/apache/spark/pull/10720] > CountVectorizerModel.transform lost the transformSchema > --- > > Key: SPARK-12765 > URL: https://issues.apache.org/jira/browse/SPARK-12765 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 1.6.1 >Reporter: sloth > Labels: patch > Fix For: 2.0.0 > > > In ml package , CountVectorizerModel forgot to do transformSchema in > transform function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)
[ https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12746: -- Assignee: Earthson Lu > ArrayType(_, true) should also accept ArrayType(_, false) > - > > Key: SPARK-12746 > URL: https://issues.apache.org/jira/browse/SPARK-12746 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 1.6.0 >Reporter: Earthson Lu >Assignee: Earthson Lu > > I see CountVectorizer has schema check for ArrayType which has > ArrayType(StringType, true). > ArrayType(String, false) is just a special case of ArrayType(String, true), > but it will not pass this type check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)
[ https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12746: -- Target Version/s: 1.6.1, 2.0.0 > ArrayType(_, true) should also accept ArrayType(_, false) > - > > Key: SPARK-12746 > URL: https://issues.apache.org/jira/browse/SPARK-12746 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 1.6.0 >Reporter: Earthson Lu >Assignee: Earthson Lu > > I see CountVectorizer has schema check for ArrayType which has > ArrayType(StringType, true). > ArrayType(String, false) is just a special case of ArrayType(String, true), > but it will not pass this type check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13294) Don't build assembly in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13294: Assignee: (was: Apache Spark) > Don't build assembly in dev/run-tests > - > > Key: SPARK-13294 > URL: https://issues.apache.org/jira/browse/SPARK-13294 > Project: Spark > Issue Type: Improvement >Reporter: Josh Rosen > > As of SPARK-9284 we should no longer need to build the full Spark assembly > JAR in order to run tests. Therefore, we should remove the assembly step from > {{dev/run-tests}} in order to reduce build + test time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13294) Don't build assembly in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143846#comment-15143846 ] Apache Spark commented on SPARK-13294: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11178 > Don't build assembly in dev/run-tests > - > > Key: SPARK-13294 > URL: https://issues.apache.org/jira/browse/SPARK-13294 > Project: Spark > Issue Type: Improvement >Reporter: Josh Rosen > > As of SPARK-9284 we should no longer need to build the full Spark assembly > JAR in order to run tests. Therefore, we should remove the assembly step from > {{dev/run-tests}} in order to reduce build + test time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13294) Don't build assembly in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-13294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13294: Assignee: Apache Spark > Don't build assembly in dev/run-tests > - > > Key: SPARK-13294 > URL: https://issues.apache.org/jira/browse/SPARK-13294 > Project: Spark > Issue Type: Improvement >Reporter: Josh Rosen >Assignee: Apache Spark > > As of SPARK-9284 we should no longer need to build the full Spark assembly > JAR in order to run tests. Therefore, we should remove the assembly step from > {{dev/run-tests}} in order to reduce build + test time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13294) Don't build assembly in dev/run-tests
Josh Rosen created SPARK-13294: -- Summary: Don't build assembly in dev/run-tests Key: SPARK-13294 URL: https://issues.apache.org/jira/browse/SPARK-13294 Project: Spark Issue Type: Improvement Reporter: Josh Rosen As of SPARK-9284 we should no longer need to build the full Spark assembly JAR in order to run tests. Therefore, we should remove the assembly step from {{dev/run-tests}} in order to reduce build + test time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13033) PySpark ml.regression support export/import
[ https://issues.apache.org/jira/browse/SPARK-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13033: -- Shepherd: Yanbo Liang Target Version/s: 2.0.0 > PySpark ml.regression support export/import > --- > > Key: SPARK-13033 > URL: https://issues.apache.org/jira/browse/SPARK-13033 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Tommy Yu >Priority: Minor > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/regression.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13033) PySpark ml.regression support export/import
[ https://issues.apache.org/jira/browse/SPARK-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13033: -- Assignee: Tommy Yu > PySpark ml.regression support export/import > --- > > Key: SPARK-13033 > URL: https://issues.apache.org/jira/browse/SPARK-13033 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Tommy Yu >Priority: Minor > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/regression.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13153) PySpark ML persistence failed when handle no default value parameter
[ https://issues.apache.org/jira/browse/SPARK-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13153: -- Shepherd: Yanbo Liang Target Version/s: 1.6.1, 2.0.0 > PySpark ML persistence failed when handle no default value parameter > > > Key: SPARK-13153 > URL: https://issues.apache.org/jira/browse/SPARK-13153 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Tommy Yu >Assignee: Tommy Yu >Priority: Minor > > This defect find when implement task spark-13033. When add below code to > doctest. > It looks like _transfer_params_from_java did not consider the params which do > not have default value and we should handle them. > >>> import os, tempfile > >>> path = tempfile.mkdtemp() > >>> aftsr_path = path + "/aftsr" > >>> aftsr.save(aftsr_path) > >>> aftsr2 = AFTSurvivalRegression.load(aftsr_path) > Exception detail. > ir2 = IsotonicRegression.load(ir_path) > Exception raised: > Traceback (most recent call last): > File "C:\Python27\lib\doctest.py", line 1289, in run > compileflags, 1) in test.globs > File "", line 1, in > ir2 = IsotonicRegression.load(ir_path) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py", > line 194, in load > return cls.read().load(path) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py", > line 148, in load > instance.transfer_params_from_java() > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\wrapper.py", > line 82, in tran > fer_params_from_java > value = _java2py(sc, self._java_obj.getOrDefault(java_param)) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", > line 813, in > _call > answer, self.gateway_client, self.target_id, self.name) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\utils.py", > line 45, in deco > return f(a, *kw) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", > line 308, in get_ > eturn_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o351.getOrDefault. > : java.util.NoSuchElementException: Failed to find a default value for > weightCol > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647) > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:646) > at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:43) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13153) PySpark ML persistence failed when handle no default value parameter
[ https://issues.apache.org/jira/browse/SPARK-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13153: -- Assignee: Tommy Yu > PySpark ML persistence failed when handle no default value parameter > > > Key: SPARK-13153 > URL: https://issues.apache.org/jira/browse/SPARK-13153 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Tommy Yu >Assignee: Tommy Yu >Priority: Minor > > This defect find when implement task spark-13033. When add below code to > doctest. > It looks like _transfer_params_from_java did not consider the params which do > not have default value and we should handle them. > >>> import os, tempfile > >>> path = tempfile.mkdtemp() > >>> aftsr_path = path + "/aftsr" > >>> aftsr.save(aftsr_path) > >>> aftsr2 = AFTSurvivalRegression.load(aftsr_path) > Exception detail. > ir2 = IsotonicRegression.load(ir_path) > Exception raised: > Traceback (most recent call last): > File "C:\Python27\lib\doctest.py", line 1289, in run > compileflags, 1) in test.globs > File "", line 1, in > ir2 = IsotonicRegression.load(ir_path) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py", > line 194, in load > return cls.read().load(path) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\util.py", > line 148, in load > instance.transfer_params_from_java() > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\ml\wrapper.py", > line 82, in tran > fer_params_from_java > value = _java2py(sc, self._java_obj.getOrDefault(java_param)) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", > line 813, in > _call > answer, self.gateway_client, self.target_id, self.name) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\utils.py", > line 45, in deco > return f(a, *kw) > File > "C:\aWorkFolder\spark\spark-1.6.0-bin-hadoop2.6\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", > line 308, in get_ > eturn_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o351.getOrDefault. > : java.util.NoSuchElementException: Failed to find a default value for > weightCol > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647) > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:647) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:646) > at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:43) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13011) K-means wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-13011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13011: -- Shepherd: Xiangrui Meng > K-means wrapper in SparkR > - > > Key: SPARK-13011 > URL: https://issues.apache.org/jira/browse/SPARK-13011 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Implement a simple wrapper in SparkR to support k-means. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13293) Generate code for Expand
[ https://issues.apache.org/jira/browse/SPARK-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13293: Assignee: Davies Liu (was: Apache Spark) > Generate code for Expand > > > Key: SPARK-13293 > URL: https://issues.apache.org/jira/browse/SPARK-13293 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13293) Generate code for Expand
[ https://issues.apache.org/jira/browse/SPARK-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143802#comment-15143802 ] Apache Spark commented on SPARK-13293: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11177 > Generate code for Expand > > > Key: SPARK-13293 > URL: https://issues.apache.org/jira/browse/SPARK-13293 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13293) Generate code for Expand
[ https://issues.apache.org/jira/browse/SPARK-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13293: Assignee: Apache Spark (was: Davies Liu) > Generate code for Expand > > > Key: SPARK-13293 > URL: https://issues.apache.org/jira/browse/SPARK-13293 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13047) Pyspark Params.hasParam should not throw an error
[ https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13047. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10962 [https://github.com/apache/spark/pull/10962] > Pyspark Params.hasParam should not throw an error > - > > Key: SPARK-13047 > URL: https://issues.apache.org/jira/browse/SPARK-13047 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.0.0, 1.6.1 > > > Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns > True if the class has a parameter by that name, but throws an > {{AttributeError}} otherwise. There is not currently a way of getting a > Boolean to indicate if a class has a parameter. With Spark 2.0 we could > modify the existing behavior of {{hasParam}} or add an additional method with > this functionality. > In Python: > {code} > from pyspark.ml.classification import NaiveBayes > nb = NaiveBayes(smoothing=0.5) > print nb.hasParam("smoothing") > print nb.hasParam("notAParam") > {code} > produces: > > True > > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' > However, in Scala: > {code} > import org.apache.spark.ml.classification.NaiveBayes > val nb = new NaiveBayes() > nb.hasParam("smoothing") > nb.hasParam("notAParam") > {code} > produces: > > true > > false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12949) Support common expression elimination
[ https://issues.apache.org/jira/browse/SPARK-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143787#comment-15143787 ] Davies Liu commented on SPARK-12949: After some prototype, enable common expression elimination could have 10+% improvement on stddev, but 50% regression on Kurtosis, have not figure why, maybe JIT can already eliminate the common expressions (given the fact that Kurtosis is only 20% slower than stddev)? If yes, we may not want to do this. > Support common expression elimination > - > > Key: SPARK-12949 > URL: https://issues.apache.org/jira/browse/SPARK-12949 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13293) Generate code for Expand
Davies Liu created SPARK-13293: -- Summary: Generate code for Expand Key: SPARK-13293 URL: https://issues.apache.org/jira/browse/SPARK-13293 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13069) ActorHelper is not throttled by rate limiter
[ https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143768#comment-15143768 ] Lin Zhao commented on SPARK-13069: -- [~zsxwing] I created the PR, please review at your convenience. We are running a patched server but if this can get into 2.0.0 it would be very helpful for us. > ActorHelper is not throttled by rate limiter > > > Key: SPARK-13069 > URL: https://issues.apache.org/jira/browse/SPARK-13069 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Lin Zhao > > The rate an actor receiver sends data to spark is not limited by maxRate or > back pressure. Spark would control how fast it writes the data to block > manager, but the receiver actor sends events asynchronously and would fill > out akka mailbox with millions of events until memory runs out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13069) ActorHelper is not throttled by rate limiter
[ https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143760#comment-15143760 ] Apache Spark commented on SPARK-13069: -- User 'lin-zhao' has created a pull request for this issue: https://github.com/apache/spark/pull/11176 > ActorHelper is not throttled by rate limiter > > > Key: SPARK-13069 > URL: https://issues.apache.org/jira/browse/SPARK-13069 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Lin Zhao > > The rate an actor receiver sends data to spark is not limited by maxRate or > back pressure. Spark would control how fast it writes the data to block > manager, but the receiver actor sends events asynchronously and would fill > out akka mailbox with millions of events until memory runs out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13069) ActorHelper is not throttled by rate limiter
[ https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13069: Assignee: (was: Apache Spark) > ActorHelper is not throttled by rate limiter > > > Key: SPARK-13069 > URL: https://issues.apache.org/jira/browse/SPARK-13069 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Lin Zhao > > The rate an actor receiver sends data to spark is not limited by maxRate or > back pressure. Spark would control how fast it writes the data to block > manager, but the receiver actor sends events asynchronously and would fill > out akka mailbox with millions of events until memory runs out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13069) ActorHelper is not throttled by rate limiter
[ https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13069: Assignee: Apache Spark > ActorHelper is not throttled by rate limiter > > > Key: SPARK-13069 > URL: https://issues.apache.org/jira/browse/SPARK-13069 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Lin Zhao >Assignee: Apache Spark > > The rate an actor receiver sends data to spark is not limited by maxRate or > back pressure. Spark would control how fast it writes the data to block > manager, but the receiver actor sends events asynchronously and would fill > out akka mailbox with millions of events until memory runs out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception
[ https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7483: - Shepherd: Sean Owen > [MLLib] Using Kryo with FPGrowth fails with an exception > > > Key: SPARK-7483 > URL: https://issues.apache.org/jira/browse/SPARK-7483 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1 >Reporter: Tomasz Bartczak >Priority: Minor > > When using FPGrowth algorithm with KryoSerializer - Spark fails with > {code} > Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): > com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: > Can not set final scala.collection.mutable.ListBuffer field > org.apache.spark.mllib.fpm.FPTree$Summary.nodes to > scala.collection.mutable.ArrayBuffer > Serialization trace: > nodes (org.apache.spark.mllib.fpm.FPTree$Summary) > org$apache$spark$mllib$fpm$FPTree$$summaries > (org.apache.spark.mllib.fpm.FPTree) > {code} > This can be easily reproduced in spark codebase by setting > {code} > conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > {code} and running FPGrowthSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13035) PySpark ml.clustering support export/import
[ https://issues.apache.org/jira/browse/SPARK-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13035: -- Assignee: Yanbo Liang > PySpark ml.clustering support export/import > --- > > Key: SPARK-13035 > URL: https://issues.apache.org/jira/browse/SPARK-13035 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/clustering.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13035) PySpark ml.clustering support export/import
[ https://issues.apache.org/jira/browse/SPARK-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13035. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10999 [https://github.com/apache/spark/pull/10999] > PySpark ml.clustering support export/import > --- > > Key: SPARK-13035 > URL: https://issues.apache.org/jira/browse/SPARK-13035 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/clustering.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13035) PySpark ml.clustering support export/import
[ https://issues.apache.org/jira/browse/SPARK-13035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13035: -- Target Version/s: 2.0.0 > PySpark ml.clustering support export/import > --- > > Key: SPARK-13035 > URL: https://issues.apache.org/jira/browse/SPARK-13035 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/clustering.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13037) PySpark ml.recommendation support export/import
[ https://issues.apache.org/jira/browse/SPARK-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13037: -- Target Version/s: 2.0.0 > PySpark ml.recommendation support export/import > --- > > Key: SPARK-13037 > URL: https://issues.apache.org/jira/browse/SPARK-13037 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Kai Jiang > Fix For: 2.0.0 > > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/recommendation.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13037) PySpark ml.recommendation support export/import
[ https://issues.apache.org/jira/browse/SPARK-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13037. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11044 [https://github.com/apache/spark/pull/11044] > PySpark ml.recommendation support export/import > --- > > Key: SPARK-13037 > URL: https://issues.apache.org/jira/browse/SPARK-13037 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang > Fix For: 2.0.0 > > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/recommendation.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13037) PySpark ml.recommendation support export/import
[ https://issues.apache.org/jira/browse/SPARK-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13037: -- Assignee: Kai Jiang > PySpark ml.recommendation support export/import > --- > > Key: SPARK-13037 > URL: https://issues.apache.org/jira/browse/SPARK-13037 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Kai Jiang > Fix For: 2.0.0 > > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/recommendation.py. Please refer the > implementation at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13047) Pyspark Params.hasParam should not throw an error
[ https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13047: -- Target Version/s: 1.6.1, 2.0.0 (was: 1.6.1) > Pyspark Params.hasParam should not throw an error > - > > Key: SPARK-13047 > URL: https://issues.apache.org/jira/browse/SPARK-13047 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns > True if the class has a parameter by that name, but throws an > {{AttributeError}} otherwise. There is not currently a way of getting a > Boolean to indicate if a class has a parameter. With Spark 2.0 we could > modify the existing behavior of {{hasParam}} or add an additional method with > this functionality. > In Python: > {code} > from pyspark.ml.classification import NaiveBayes > nb = NaiveBayes(smoothing=0.5) > print nb.hasParam("smoothing") > print nb.hasParam("notAParam") > {code} > produces: > > True > > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' > However, in Scala: > {code} > import org.apache.spark.ml.classification.NaiveBayes > val nb = new NaiveBayes() > nb.hasParam("smoothing") > nb.hasParam("notAParam") > {code} > produces: > > true > > false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13047) Pyspark Params.hasParam should not throw an error
[ https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13047: -- Assignee: Seth Hendrickson > Pyspark Params.hasParam should not throw an error > - > > Key: SPARK-13047 > URL: https://issues.apache.org/jira/browse/SPARK-13047 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns > True if the class has a parameter by that name, but throws an > {{AttributeError}} otherwise. There is not currently a way of getting a > Boolean to indicate if a class has a parameter. With Spark 2.0 we could > modify the existing behavior of {{hasParam}} or add an additional method with > this functionality. > In Python: > {code} > from pyspark.ml.classification import NaiveBayes > nb = NaiveBayes(smoothing=0.5) > print nb.hasParam("smoothing") > print nb.hasParam("notAParam") > {code} > produces: > > True > > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' > However, in Scala: > {code} > import org.apache.spark.ml.classification.NaiveBayes > val nb = new NaiveBayes() > nb.hasParam("smoothing") > nb.hasParam("notAParam") > {code} > produces: > > true > > false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13047) Pyspark Params.hasParam should not throw an error
[ https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13047: -- Target Version/s: 1.6.1 > Pyspark Params.hasParam should not throw an error > - > > Key: SPARK-13047 > URL: https://issues.apache.org/jira/browse/SPARK-13047 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns > True if the class has a parameter by that name, but throws an > {{AttributeError}} otherwise. There is not currently a way of getting a > Boolean to indicate if a class has a parameter. With Spark 2.0 we could > modify the existing behavior of {{hasParam}} or add an additional method with > this functionality. > In Python: > {code} > from pyspark.ml.classification import NaiveBayes > nb = NaiveBayes(smoothing=0.5) > print nb.hasParam("smoothing") > print nb.hasParam("notAParam") > {code} > produces: > > True > > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' > However, in Scala: > {code} > import org.apache.spark.ml.classification.NaiveBayes > val nb = new NaiveBayes() > nb.hasParam("smoothing") > nb.hasParam("notAParam") > {code} > produces: > > true > > false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13019) Replace example code in mllib-statistics.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13019: -- Shepherd: Xusen Yin > Replace example code in mllib-statistics.md using include_example > - > > Key: SPARK-13019 > URL: https://issues.apache.org/jira/browse/SPARK-13019 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13019) Replace example code in mllib-statistics.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13019: -- Assignee: Xin Ren > Replace example code in mllib-statistics.md using include_example > - > > Key: SPARK-13019 > URL: https://issues.apache.org/jira/browse/SPARK-13019 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13013) Replace example code in mllib-clustering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13013: -- Assignee: Xin Ren > Replace example code in mllib-clustering.md using include_example > - > > Key: SPARK-13013 > URL: https://issues.apache.org/jira/browse/SPARK-13013 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13013) Replace example code in mllib-clustering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13013: -- Shepherd: Xusen Yin > Replace example code in mllib-clustering.md using include_example > - > > Key: SPARK-13013 > URL: https://issues.apache.org/jira/browse/SPARK-13013 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13015) Replace example code in mllib-data-types.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13015: -- Assignee: Xin Ren > Replace example code in mllib-data-types.md using include_example > - > > Key: SPARK-13015 > URL: https://issues.apache.org/jira/browse/SPARK-13015 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13018: -- Assignee: Xin Ren > Replace example code in mllib-pmml-model-export.md using include_example > > > Key: SPARK-13018 > URL: https://issues.apache.org/jira/browse/SPARK-13018 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13018: -- Shepherd: Xusen Yin (was: Xusen Yin) > Replace example code in mllib-pmml-model-export.md using include_example > > > Key: SPARK-13018 > URL: https://issues.apache.org/jira/browse/SPARK-13018 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13014: -- Assignee: Xin Ren > Replace example code in mllib-collaborative-filtering.md using include_example > -- > > Key: SPARK-13014 > URL: https://issues.apache.org/jira/browse/SPARK-13014 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13015) Replace example code in mllib-data-types.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13015: -- Shepherd: Xusen Yin > Replace example code in mllib-data-types.md using include_example > - > > Key: SPARK-13015 > URL: https://issues.apache.org/jira/browse/SPARK-13015 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13018: -- Shepherd: Xusen Yin > Replace example code in mllib-pmml-model-export.md using include_example > > > Key: SPARK-13018 > URL: https://issues.apache.org/jira/browse/SPARK-13018 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13014: -- Shepherd: Xusen Yin > Replace example code in mllib-collaborative-filtering.md using include_example > -- > > Key: SPARK-13014 > URL: https://issues.apache.org/jira/browse/SPARK-13014 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13016) Replace example code in mllib-dimensionality-reduction.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13016: -- Assignee: Devaraj K > Replace example code in mllib-dimensionality-reduction.md using > include_example > --- > > Key: SPARK-13016 > URL: https://issues.apache.org/jira/browse/SPARK-13016 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Devaraj K >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13016) Replace example code in mllib-dimensionality-reduction.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13016: -- Shepherd: Xusen Yin > Replace example code in mllib-dimensionality-reduction.md using > include_example > --- > > Key: SPARK-13016 > URL: https://issues.apache.org/jira/browse/SPARK-13016 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13017) Replace example code in mllib-feature-extraction.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13017: -- Shepherd: Xusen Yin Assignee: Xin Ren > Replace example code in mllib-feature-extraction.md using include_example > - > > Key: SPARK-13017 > URL: https://issues.apache.org/jira/browse/SPARK-13017 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Xin Ren >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13292) QuantileDiscretizer should take random seed in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13292: -- Description: SPARK-11515 for the Python API. > QuantileDiscretizer should take random seed in PySpark > -- > > Key: SPARK-13292 > URL: https://issues.apache.org/jira/browse/SPARK-13292 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yu Ishikawa >Priority: Minor > > SPARK-11515 for the Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13292) QuantileDiscretizer should take random seed in PySpark
Xiangrui Meng created SPARK-13292: - Summary: QuantileDiscretizer should take random seed in PySpark Key: SPARK-13292 URL: https://issues.apache.org/jira/browse/SPARK-13292 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Yu Ishikawa Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11515) QuantileDiscretizer should take random seed
[ https://issues.apache.org/jira/browse/SPARK-11515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11515: -- Target Version/s: 2.0.0 > QuantileDiscretizer should take random seed > --- > > Key: SPARK-11515 > URL: https://issues.apache.org/jira/browse/SPARK-11515 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Minor > Fix For: 2.0.0 > > > QuantileDiscretizer takes a random sample to select bins. It currently does > not specify a seed for the XORShiftRandom, but it should take a seed by > extending the HasSeed Param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11515) QuantileDiscretizer should take random seed
[ https://issues.apache.org/jira/browse/SPARK-11515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11515: -- Assignee: Yu Ishikawa > QuantileDiscretizer should take random seed > --- > > Key: SPARK-11515 > URL: https://issues.apache.org/jira/browse/SPARK-11515 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Minor > Fix For: 2.0.0 > > > QuantileDiscretizer takes a random sample to select bins. It currently does > not specify a seed for the XORShiftRandom, but it should take a seed by > extending the HasSeed Param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11515) QuantileDiscretizer should take random seed
[ https://issues.apache.org/jira/browse/SPARK-11515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11515. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 9535 [https://github.com/apache/spark/pull/9535] > QuantileDiscretizer should take random seed > --- > > Key: SPARK-11515 > URL: https://issues.apache.org/jira/browse/SPARK-11515 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > Fix For: 2.0.0 > > > QuantileDiscretizer takes a random sample to select bins. It currently does > not specify a seed for the XORShiftRandom, but it should take a seed by > extending the HasSeed Param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13265) Refactoring of basic ML import/export for other file system besides HDFS
[ https://issues.apache.org/jira/browse/SPARK-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13265. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 11151 [https://github.com/apache/spark/pull/11151] > Refactoring of basic ML import/export for other file system besides HDFS > > > Key: SPARK-13265 > URL: https://issues.apache.org/jira/browse/SPARK-13265 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yu Ishikawa >Assignee: Yu Ishikawa > Fix For: 2.0.0, 1.6.1 > > > We can't save a model into other file system besides HDFS, for example Amazon > S3. Because the file system is fixed at Spark 1.6. > https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L78 > When I tried to export a KMeans model into Amazon S3, I got the error. > {noformat} > scala> val kmeans = new KMeans().setK(2) > scala> val model = kmeans.fit(train) > scala> model.write.overwrite().save("s3n://test-bucket/tmp/test-kmeans/") > java.lang.IllegalArgumentException: Wrong FS: > s3n://test-bucket/tmp/test-kmeans, expected: > hdfs://ec2-54-248-42-97.ap-northeast-1.compute.amazonaws.c > om:9000 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:590) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:170) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:803) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:80) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC.(:49) > at $iwC$$iwC.(:51) > at $iwC.(:53) > at (:55) > at .(:59) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at
[jira] [Updated] (SPARK-13265) Refactoring of basic ML import/export for other file system besides HDFS
[ https://issues.apache.org/jira/browse/SPARK-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13265: -- Assignee: Yu Ishikawa > Refactoring of basic ML import/export for other file system besides HDFS > > > Key: SPARK-13265 > URL: https://issues.apache.org/jira/browse/SPARK-13265 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yu Ishikawa >Assignee: Yu Ishikawa > > We can't save a model into other file system besides HDFS, for example Amazon > S3. Because the file system is fixed at Spark 1.6. > https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L78 > When I tried to export a KMeans model into Amazon S3, I got the error. > {noformat} > scala> val kmeans = new KMeans().setK(2) > scala> val model = kmeans.fit(train) > scala> model.write.overwrite().save("s3n://test-bucket/tmp/test-kmeans/") > java.lang.IllegalArgumentException: Wrong FS: > s3n://test-bucket/tmp/test-kmeans, expected: > hdfs://ec2-54-248-42-97.ap-northeast-1.compute.amazonaws.c > om:9000 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:590) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:170) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:803) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:80) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC.(:49) > at $iwC$$iwC.(:51) > at $iwC.(:53) > at (:55) > at .(:59) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.sp
[jira] [Updated] (SPARK-13265) Refactoring of basic ML import/export for other file system besides HDFS
[ https://issues.apache.org/jira/browse/SPARK-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13265: -- Target Version/s: 1.6.1, 2.0.0 > Refactoring of basic ML import/export for other file system besides HDFS > > > Key: SPARK-13265 > URL: https://issues.apache.org/jira/browse/SPARK-13265 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yu Ishikawa >Assignee: Yu Ishikawa > > We can't save a model into other file system besides HDFS, for example Amazon > S3. Because the file system is fixed at Spark 1.6. > https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L78 > When I tried to export a KMeans model into Amazon S3, I got the error. > {noformat} > scala> val kmeans = new KMeans().setK(2) > scala> val model = kmeans.fit(train) > scala> model.write.overwrite().save("s3n://test-bucket/tmp/test-kmeans/") > java.lang.IllegalArgumentException: Wrong FS: > s3n://test-bucket/tmp/test-kmeans, expected: > hdfs://ec2-54-248-42-97.ap-northeast-1.compute.amazonaws.c > om:9000 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:590) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:170) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:803) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:80) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC.(:49) > at $iwC$$iwC.(:51) > at $iwC.(:53) > at (:55) > at .(:59) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.
[jira] [Commented] (SPARK-13279) Scheduler does O(N^2) operation when adding a new task set (making it prohibitively slow for scheduling 200K tasks)
[ https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143629#comment-15143629 ] Apache Spark commented on SPARK-13279: -- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/11175 > Scheduler does O(N^2) operation when adding a new task set (making it > prohibitively slow for scheduling 200K tasks) > --- > > Key: SPARK-13279 > URL: https://issues.apache.org/jira/browse/SPARK-13279 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 1.6.0 >Reporter: Sital Kedia > > For each task that the TaskSetManager adds, it iterates through the entire > list of existing tasks to check if it's there. As a result, scheduling a new > task set is O(N^2), which can be slow for large task sets. > This is a bug that was introduced by > https://github.com/apache/spark/commit/3535b91: that commit removed the > "!readding" condition from the if-statement, but since the re-adding > parameter defaulted to false, that commit should have removed the condition > check in the if-statement altogether. > - > We discovered this bug while running a large pipeline with 200k tasks, when > we found that the executors were not able to register with the driver because > the driver was stuck holding a global lock in TaskSchedulerImpl.submitTasks > function for a long time (it wasn't deadlocked -- just taking a long time). > jstack of the driver - http://pastebin.com/m8CP6VMv > executor log - http://pastebin.com/2NPS1mXC > From the jstack I see that the thread handing the resource offer from > executors (dispatcher-event-loop-9) is blocked on a lock held by the thread > "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer > when adding a pending tasks. So when we have 200k pending tasks, because of > this o(n2) operations, the driver is just hung for more than 5 minutes. > Solution - In addPendingTask function, we don't really need a duplicate > check. It's okay if we add a task to the same queue twice because > dequeueTaskFromList will skip already-running tasks. > Please note that this is a regression from Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13277) ANTLR ignores other rule using the USING keyword
[ https://issues.apache.org/jira/browse/SPARK-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143571#comment-15143571 ] Apache Spark commented on SPARK-13277: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/11174 > ANTLR ignores other rule using the USING keyword > > > Key: SPARK-13277 > URL: https://issues.apache.org/jira/browse/SPARK-13277 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Herman van Hovell >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.0.0 > > > ANTLR currently emits the following warning during compilation: > {noformat} > warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: > Decision can match input such as "KW_USING Identifier" using multiple > alternatives: 2, 3 > As a result, alternative(s) 3 were disabled for that input > {noformat} > This means that some of the functionality of the parser is disabled. This is > introduced by the migration of the DDLParsers > (https://github.com/apache/spark/pull/10723). We should be able to fix this > by introducing a syntactic predicate for USING. > cc [~viirya] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13291) Numerical models should preserve label attributes
Piotr Smolinski created SPARK-13291: --- Summary: Numerical models should preserve label attributes Key: SPARK-13291 URL: https://issues.apache.org/jira/browse/SPARK-13291 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.6.0 Reporter: Piotr Smolinski Priority: Minor I tried building a simple pipeline for Random Forest classification. The predictors are some doubles, some ints and some strings. The response is string. The RFormula seems to be a perfect candidate. RFormulaModel produces nicely *labelCol* column with StringIndexer derived metadata and RandomForestClassificationModel converts the *featuresCol* to *predictionCol*. The problem is that there is no way to convert the *predictionCol* (which is factor index) back to the label. The metadata created by StringIndexer is lost. The numerical models should create the *predictionCol* columns with metadata seen on the *labelCol* column during the model fitting. Preserving metadata allows for example to pipeline RFormula, RandomForestClassifier and IndexToString. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13290) wholeTextFile and binaryFiles are really slow
mathieu longtin created SPARK-13290: --- Summary: wholeTextFile and binaryFiles are really slow Key: SPARK-13290 URL: https://issues.apache.org/jira/browse/SPARK-13290 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.6.0 Environment: Linux stand-alone Reporter: mathieu longtin Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely slow. It takes 3 minutes in Java versus 2.5 seconds in Python. The java process balloons to 4.3GB of memory and uses 100% CPU the whole time. I suspects Spark reads it in small chunks and assembles it at the end, hence the large amount of CPU. {code} In [49]: rdd = sc.binaryFiles(pathToOneFile) In [50]: %time path, text = rdd.first() CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s Wall time: 3min 32s In [51]: len(text) Out[51]: 191376122 In [52]: %time text = open(pathToOneFile).read() CPU times: user 8 ms, sys: 691 ms, total: 699 ms Wall time: 2.43 s In [53]: len(text) Out[53]: 191376122 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8162) Run spark-shell cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143513#comment-15143513 ] Matthew Campbell edited comment on SPARK-8162 at 2/11/16 9:14 PM: -- I'm also running into this problem with the latest spark-1.6.0-bin-hadoop2.6 on a Windows 7 machine. was (Author: mtthwcmpbll): I'm also running into this problem with the latest spark-1.6.0-bin-hadoop2.6. > Run spark-shell cause NullPointerException > -- > > Key: SPARK-8162 > URL: https://issues.apache.org/jira/browse/SPARK-8162 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.4.1, 1.5.0 >Reporter: Weizhong >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.4.1, 1.5.0 > > > run spark-shell on latest master branch, then failed, details are: > {noformat} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) > Type in expressions to have them evaluated. > Type :help for more information. > error: error while loading JobProgressListener, Missing dependency 'bad > symbolic reference. A signature in JobProgressListener.class refers to term > annotations > in package com.google.common which is not available. > It may be completely missing from the current classpath, or the version on > the classpath might be incompatible with the version used when compiling > JobProgressListener.class.', required by > /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class) > java.lang.NullPointerException > at org.apache.spark.sql.SQLContext.(SQLContext.scala:193) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:68) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) > at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) > at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) > at > org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157) > at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) > at > org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apa
[jira] [Commented] (SPARK-8162) Run spark-shell cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143513#comment-15143513 ] Matthew Campbell commented on SPARK-8162: - I'm also running into this problem with the latest spark-1.6.0-bin-hadoop2.6. > Run spark-shell cause NullPointerException > -- > > Key: SPARK-8162 > URL: https://issues.apache.org/jira/browse/SPARK-8162 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.4.1, 1.5.0 >Reporter: Weizhong >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.4.1, 1.5.0 > > > run spark-shell on latest master branch, then failed, details are: > {noformat} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) > Type in expressions to have them evaluated. > Type :help for more information. > error: error while loading JobProgressListener, Missing dependency 'bad > symbolic reference. A signature in JobProgressListener.class refers to term > annotations > in package com.google.common which is not available. > It may be completely missing from the current classpath, or the version on > the classpath might be incompatible with the version used when compiling > JobProgressListener.class.', required by > /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class) > java.lang.NullPointerException > at org.apache.spark.sql.SQLContext.(SQLContext.scala:193) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:68) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:9) > at $iwC.(:18) > at (:20) > at .(:24) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) > at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) > at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) > at > org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157) > at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) > at > org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(
[jira] [Created] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5
Qi Dai created SPARK-13289: -- Summary: Word2Vec generate infinite distances when numIterations>5 Key: SPARK-13289 URL: https://issues.apache.org/jira/browse/SPARK-13289 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.6.0 Environment: Linux, Scala Reporter: Qi Dai I recently ran some word2vec experiments on a cluster with 50 executors on some large text dataset but find out that when number of iterations is larger than 5 the distance between words will be all infinite. My code looks like this: val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" ").toSeq) import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val word2vec = new Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5) val model = word2vec.fit(text) val synonyms = model.findSynonyms("who", 40) for((synonym, cosineSimilarity) <- synonyms) { println(s"$synonym $cosineSimilarity") } The results are: to Infinity and Infinity that Infinity with Infinity said Infinity it Infinity by Infinity be Infinity have Infinity he Infinity has Infinity his Infinity an Infinity ) Infinity not Infinity who Infinity I Infinity had Infinity their Infinity were Infinity they Infinity but Infinity been Infinity I tried many different datasets and different words for finding synonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13288) [1.6.0] Memory leak in Spark streaming
JESSE CHEN created SPARK-13288: -- Summary: [1.6.0] Memory leak in Spark streaming Key: SPARK-13288 URL: https://issues.apache.org/jira/browse/SPARK-13288 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.6.0 Environment: Bare metal cluster RHEL 6.6 Reporter: JESSE CHEN Streaming in 1.6 seems to have a memory leak. Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 showed a gradual increasing processing time. The app is simple: 1 Kafka receiver of tweet stream and 20 executors processing the tweets in 5-second batches. Spark 1.5.0 handles this smoothly and did not show increasing processing time in the 40-minute test; but 1.6 showed increasing time about 8 minutes into the test. Please see chart here: https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116 I captured heap dumps in two version and did a comparison. I noticed the Byte is using 50X more space in 1.5.1. Here are some top classes in heap histogram and references. Heap Histogram All Classes (excluding platform) 1.6.0 Streaming 1.5.1 Streaming Class Instance Count Total Size Class Instance Count Total Size class [B84533,227,649,599 class [B5095 62,938,466 class [C44682 4,255,502 class [C130482 12,844,182 class java.lang.reflect.Method 90591,177,670 class java.lang.String 130171 1,562,052 References by Type References by Type class [B [0x640039e38] class [B [0x6c020bb08] Referrers by Type Referrers by Type Class Count Class Count java.nio.HeapByteBuffer 3239 sun.security.util.DerInputBuffer1233 sun.security.util.DerInputBuffer1233 sun.security.util.ObjectIdentifier 620 sun.security.util.ObjectIdentifier 620 [[B 397 [Ljava.lang.Object; 408 java.lang.reflect.Method 326 The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0. The Java.nio.HeapByteBuffer referencing class did not show up in top in 1.5.1. I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them here https://ibm.box.com/sparkstreaming-jstack160 https://ibm.box.com/sparkstreaming-jstack151 Jesse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13287) Standalone REST API throttling?
Rares Vernica created SPARK-13287: - Summary: Standalone REST API throttling? Key: SPARK-13287 URL: https://issues.apache.org/jira/browse/SPARK-13287 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.6.0 Reporter: Rares Vernica Priority: Minor I am using the REST API provided by Spark Standalone mode to check on jobs. It turns out that if I don't pause between requests the server will redirect me to the server homepage instead of offering the requested information. Here is a simple test to prove this: {code:JavaScript} $ curl --silent http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head -2 ; curl --silent http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head -2 [ { "jobId" : 0, {code} I am requesting the same information about one application twice using {{curl}}. I print the first two lines from each response. The requests are made immediately one after another. The first two lines are from the first request, the last two lines are from the second request. Again, the request URLs are identical. The response from the second request is identical with the homepage you get from http://localhost:8080/ If I insert a {{sleep 1}} between the two {{curl}} commands, both work fine. For smaller time outs, like {{sleep .8}} it does not work correctly. I am not sure if this is intentional or a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier
[ https://issues.apache.org/jira/browse/SPARK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-12982. --- Resolution: Resolved Assignee: Jayadevan M Fix Version/s: 2.0.0 > SQLContext: temporary table registration does not accept valid identifier > - > > Key: SPARK-12982 > URL: https://issues.apache.org/jira/browse/SPARK-12982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Grzegorz Chilkiewicz >Assignee: Jayadevan M >Priority: Minor > Labels: sql > Fix For: 2.0.0 > > > We have encountered very strange behavior of SparkSQL temporary table > registration. > What identifiers for temporary table should be valid? > Alphanumerical + '_' with at least one non-digit? > Valid identifiers: > df > 674123a > 674123_ > a0e97c59_4445_479d_a7ef_d770e3874123 > 1ae97c59_4445_479d_a7ef_d770e3874123 > Invalid identifier: > 10e97c59_4445_479d_a7ef_d770e3874123 > Stack trace: > {code:xml} > java.lang.RuntimeException: [1.1] failure: identifier expected > 10e97c59_4445_479d_a7ef_d770e3874123 > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) > at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763) > at > SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:9) > at > SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:42) > at > SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at sbt.Run.invokeMain(Run.scala:67) > at sbt.Run.run0(Run.scala:61) > at sbt.Run.sbt$Run$$execute$1(Run.scala:51) > at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55) > at sbt.Run$$anonfun$run$1.apply(Run.scala:55) > at sbt.Run$$anonfun$run$1.apply(Run.scala:55) > at sbt.Logger$$anon$4.apply(Logger.scala:85) > at sbt.TrapExit$App.run(TrapExit.scala:248) > at java.lang.Thread.run(Thread.java:745) > {code} > Code to reproduce this bug: > https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org