[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

thunterdb Mon, 27 Mar 2017 17:19:06 -0700

Github user thunterdb commented on the issue:

    https://github.com/apache/spark/pull/17419
  
    I have added a small perf test to find the performance bottlenecks. Note 
that this test works on the worst case (vectors of size 1) from the perspective 
of overhead. Here are the numbers I currently get. I will profile the code to 
see if there are some obvious targets for optimization:
    
    ```
    [min ~ median ~ max], higher is better:
    
    RDD = [2482 ~ 46150 ~ 48354] records / milli
    dataframe (variance only) = [4217 ~ 4557 ~ 4848] records / milli
    dataframe = [2887 ~ 4420 ~ 4717] records / milli
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

Reply via email to