[jira] [Commented] (MADLIB-1168) Balance datasets

ASF GitHub Bot (JIRA) Thu, 19 Apr 2018 11:42:12 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444594#comment-16444594
 ]


ASF GitHub Bot commented on MADLIB-1168:
----------------------------------------

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/265#discussion_r182843994
  
    --- Diff: RELEASE_NOTES ---
    @@ -9,6 +9,56 @@ commit history located at 
https://github.com/apache/madlib/commits/master.
     
     Current list of bugs and issues can be found at 
https://issues.apache.org/jira/browse/MADLIB.
     —-------------------------------------------------------------------------
    +MADlib v1.14:
    +
    +Release Date: 2018-April-28
    +
    +New features:
    +* New module - Balanced datasets: A sampling module to balance 
classification
    +    datasets by resampling using various techniques including 
undersampling,
    +    oversampling, uniform sampling or user-defined proportion sampling
    +    (MADLIB-1168)
    +* Mini-batch: Added a mini-batch optimizer for MLP and a preprocessor 
function
    +    necessary to create batches from the data (MADLIB-1200, MADLIB-1206)
    +* k-NN: Added weighted averaging/voting by distance (MADLIB-1181)
    +* Summary: Added additional stats: number of positive, negative, zero 
values and
    +    95% confidence intervals for the mean (MADLIB-1167)
    +* Encode categorical: Updated to produce lower-case column names when 
possible
    +    (MADLIB-1202)
    +* MLP: Added support for already one-hot encoded categorical dependent 
variable
    +    in a classification task (MADLIB-1222)
    +* Pagerank: Added option for personalized vertices that allows higher 
weightage
    +    for a subset of vertices which will have a higher jump probability as
    +    compared to other vertices and a random surfer is more likely to
    +    jump to these personalization vertices (MADLIB-1084)
    +
    +Bug fixes:
    +    - Fixed issue with invalid calls of construct_array that led to 
problems
    +    in Postgresql 10 (MADLIB-1185)
    +    - Added newline between file concatenation during PGXN install 
(MADLIB-1194)
    +    - Fixed upgrade issues in knn (MADLIB-1197)
    +    - Added fix to ensure RF variable importance are always non-negative
    +    - Fixed inconsistency in LDA output and improved usability
    +        (MADLIB-1160, MADLIB-1201)
    +    - Fixed MLP and RF predict for models trained in earlier versions to
    +        ensure misisng optional parameters are given appropriate default 
values
    +        (MADLIB-1207)
    +    - Fixed a scenario in DT where no features exist due categorical 
columns
    +        with single level being dropped led to the database crashing
    +    - Fixed step size initialization in MLP based on learning rate policy
    +        (MADLIB-1212)
    +    - Fixed PCA issue that leads to failure when grouping column is a TEXT 
type
    +        (MADLIB-1215)
    +    - Fixed cat levels output in DT when grouping is enabled (MADLIB-1218)
    +    - Fixed and simplified initialization of model coefficients in MLP
    +    - Removed source table dependency for predicting regression models in 
MLP
    +        (MADLIB-1223)
    +    - Print loss of first iteration in MLP (MADLIB-1228)
    +
    --- End diff --
    
    We should mention MADLIB-1209 Neural net related bug fix.


> Balance datasets
> ----------------
>
>                 Key: MADLIB-1168
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1168
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Sampling
>            Reporter: Frank McQuillan
>            Assignee: ssoni
>            Priority: Major
>             Fix For: v1.14
>
>         Attachments: MADlib Balance Datasets Requirements.pdf, 
> MADlib_Balance_Datasets_Requirements_v2.pdf
>
>
> From [1] here is the motivation behind balancing datasets:
> “Most classification algorithms will only perform optimally when the number 
> of samples of each class is roughly the same. Highly skewed datasets, where 
> the minority is heavily outnumbered by one or more classes, have proven to be 
> a challenge while at the same time becoming more and more common.
> One way of addressing this issue is by re-sampling the dataset as to offset 
> this imbalance with the hope of arriving at a more robust and fair decision 
> boundary than you would otherwise.
> Re-sampling techniques can be divided in these categories:
> * Under-sampling the majority class(es).
> * Over-sampling the minority class.
> * Combining over- and under-sampling.
> * Create ensemble balanced sets.”
> There is an extensive literature on balancing datasets.  The plan for MADlib 
> in the initial phase is to offer basic functionality that can be extended in 
> later phases based on feedback from users.  
> Please see attached document for proposed scope of this story.
> References
> [1] imbalance-learn Python project
> http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
> https://github.com/scikit-learn-contrib/imbalanced-learn



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MADLIB-1168) Balance datasets

Reply via email to