GitHub user thunterdb opened a pull request:

    https://github.com/apache/spark/pull/17419

    [SPARK-19634][ML][WIP] Multivariate summarizer - dataframes API

    ## What changes were proposed in this pull request?
    
    This patch adds the DataFrames API to the multivariate summarizer (mean, 
variance, etc.). In addition to all the features of 
`MultivariateOnlineSummarizer`, it also allows the user to select a subset of 
the metrics. This should resolve some performance issues related to computing 
unrequested metrics.
    
    Furthermore, it uses the BLAS API to the extent possible, so that the given 
code should be efficient for the dense case.
    
    ## How was this patch tested?
    
    This patch includes most of the tests of the RDD-based. It compares results 
against the existing `MultivariateOnlineSummarizer` as well as adding more 
tests.
    
    This patch also includes some documentation for some low-level constructs 
such as `TypedImperativeAggregate`.
    
    ## Performance
    
    I have not run tests against the existing implementation. However, this 
patch uses the recommended low-level SQL APIs, so it should be interesting to 
compare both implementation in that respect.
    
    ## WIP
    
    Marked as WIP because some debugging comments are still present in the code.
    
    Thanks to @hvanhovell and Cheng Liang for suggestions on SparkSQL.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/thunterdb/spark 19634

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17419
    
----
commit f3fa6580bca70f3307d70e938ef8531c928d958b
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-03T18:36:02Z

    work

commit 7539835dad863a6b73d88d79983342f9ddb7fb9d
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-06T22:38:41Z

    work on the test suite

commit 673943f334b94e5d1ecd8874cb82bbc875d739c6
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-07T00:01:30Z

    last work

commit 202b672afec127f4e0885cf3a58f4dfc97031fc6
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-13T22:48:47Z

    work on using imperative aggregators

commit be019813f241d0ad3559b4d84339f1bb1055cbc4
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-17T21:44:40Z

    Merge remote-tracking branch 'upstream/master' into 19634

commit a983284cfeddabd017792e3991cf99a7d3ab1e16
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-18T00:14:40Z

    more work on summarizer

commit 647a4fecb17d478c3c8cd68d40f2a9456eb10c66
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T17:47:30Z

    work

commit 3c4bef772a3cbc759e43223af658a357c5ca6bc2
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T18:54:16Z

    changes

commit 56390ccc456c67b2f7a08c1271fa50408518da0f
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T18:54:19Z

    Merge remote-tracking branch 'upstream/master' into 19634

commit c3f236c4422031ae818cb6bbec2415b3f1bf7b70
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T19:03:07Z

    cleanup

commit ef955c00275705f14342f3e4ed970a78f0f3c141
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T22:42:42Z

    debugging

commit a04f923913ca1118a61d66bd53b8514af62594d7
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T23:14:23Z

    work

commit 946d490c8b29e55ec0e6d40785122269063894ad
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-22T21:14:29Z

    Merge remote-tracking branch 'upstream/master' into 19634

commit 201eb7712054967cd5093d3a908f4ebbd73f30a8
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-22T21:19:57Z

    debug

commit f4dec88a49d0a20e1b328617fd721633fd8c201a
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-23T18:27:19Z

    trying to debug serialization issue

commit 4af0f47d326ef91d7cf9ccaf6a45ee3f904b191f
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-23T23:16:10Z

    better tests

commit 9f29030f75089884156bdc4ee634857b3730114d
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-24T00:12:28Z

    changes

commit e9877dc2f08d393f079bdf6fbbf1b9b9abaa21da
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-24T21:04:32Z

    debugging

commit 3a11d0265ef665a63cd070eeb1ae4ac25bc89908
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-24T22:14:06Z

    more tests and debugging

commit 6d26c17d0bd4ab18d564ee7f37916780211702d5
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-24T23:12:19Z

    fixed tests

commit 35eaeb0d02ae9cc29ae559231fe4858935315477
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-24T23:23:15Z

    doc

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to