[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278148#comment-16278148 ] zhengruifeng commented on SPARK-19634: -- I think we can now use the new summarizer in the algs. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > Fix For: 2.3.0 > > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163106#comment-16163106 ] Seth Hendrickson commented on SPARK-19634: -- Is there a plan for moving the linear algorithms that use the summarizer to this new implementation? > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > Fix For: 2.3.0 > > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156936#comment-16156936 ] Apache Spark commented on SPARK-19634: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/19156 > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > Fix For: 2.3.0 > > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111585#comment-16111585 ] Hyukjin Kwon commented on SPARK-19634: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/18798 > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944215#comment-15944215 ] Timothy Hunter commented on SPARK-19634: [~sethah], yes, thanks for bringing up these concerns. Regarding the first points, the UDAF interface does not let you update arrays in place, which is a non-starter in our case. This is why the implementation switches to TIA. I have updated the design doc with these comments. Regarding the performance, I agree that there is a tension between having an API that is compatible with structured streaming and the current, RDD-based implementation. I will provide some test numbers so that we have a basis for discussion. That being said, the RDD API is not going away, so if users care about performance and do not need the additional benefit of integrating with SQL or structured streaming, they can still use it. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944030#comment-15944030 ] Seth Hendrickson commented on SPARK-19634: -- I'm coming to this a bit late, but I'm finding things a bit hard to follow. Reading the design doc, it seems that the original plan was to implement two interfaces - an RDD one that provides the same performance as current {{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. from design doc: "...In the meantime, there will be a (possibly faster) RDD interface and a (more flexible) Dataframe interface." Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the it was pivoted away from UDAF, but the design doc does not reflect that. Also, if there is to be an RDD interface, what is the JIRA for it and what will it look like? Also, there are several concerns raised in the design doc about this Catalyst aggregate approach being less efficient, and the consensus seemed to be: provide an initial API with a "slow" implementation that will be improved upon in the future. Is that correct? I'm not that familiar with the Catalyst optimizer, but are we sure there is a good way to implement the tree-reduce type aggregation, and if so could we document that? If this is still targeted at 2.2, why? I'd prefer to get the details hashed out further rather than rushing to provide an API and initial slow implementation, that way we can make sure that we get this correct in the long-term. I really appreciate some clarification and my apologies if I have missed any of the details/discussion. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944019#comment-15944019 ] Timothy Hunter commented on SPARK-19634: [~dongjin] [~wm624] sorry it looks like I missed your comments... I pushed a PR for this feature. Please feel free to comment on the PR if you have the time. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941322#comment-15941322 ] Apache Spark commented on SPARK-19634: -- User 'thunterdb' has created a pull request for this issue: https://github.com/apache/spark/pull/17419 > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935304#comment-15935304 ] Miao Wang commented on SPARK-19634: --- Comments never come to email box. [~timhunter] I can continue with your code. Let me check it out. Thanks! > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923770#comment-15923770 ] Lee Dongjin commented on SPARK-19634: - [~timhunter] // It seems like I can help. Please assign it to me. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923119#comment-15923119 ] Timothy Hunter commented on SPARK-19634: I was not able to finish it in time, but the bulk of the code is in this branch: https://github.com/apache/spark/compare/master...thunterdb:19634?expand=1 Note that it currently includes a (non-working) UDAF and an incomplete TypedImperativeAggregate. It turns out that UDAF interface is not suited for this sort of aggregators, which I realized quite late. I started to refactor my code to use TypedImperativeAggregate, but did not have to finish it. If someone wants to pick up this task, he or she is welcome to do it. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889181#comment-15889181 ] Joseph K. Bradley commented on SPARK-19634: --- I'll assign this to [~timhunter] given the time pressure for 2.2 > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886900#comment-15886900 ] Timothy Hunter commented on SPARK-19634: [~wm624] were you able to start to work on this task? I have some time now and I can work on it. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880628#comment-15880628 ] Nick Pentreath commented on SPARK-19634: Ah I see it was discussed in the design doc - will go through it more detail and check the comments. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880623#comment-15880623 ] Nick Pentreath commented on SPARK-19634: Thanks [~timhunter]. In terms of performance, we expect to gain from (a) not computing unnecessary metrics or values (saving mainly in memory usage for the intermediate arrays created, potentially some computation saving); and (b) using UDAF. Do we expect a large gain from using UDAF? I'm not totally up to date on the current state of UDAF integration into working with Tungsten data, but my last impression was that (a) UDAFs didn't really offer this unless they're internal (like HyperLogLog) and (b) array storage & SerDe in Tungsten was still a bit patchy. Has this changed? Of course in terms of API it is beneficial and we should do it anyway under the assumption that performance is at least the same as the current implementation. I just want to understand the expected performance gains since the implicit assumption is always "DataFrame operations will be so much faster" but in practice this is not always the case for more complex data types & situations, and things that switch into RDDs anyway under the hood such as in the linear models cases... > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870825#comment-15870825 ] Miao Wang commented on SPARK-19634: --- I can give a try. Thanks! Miao > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org