[jira] [Commented] (SPARK-6509) MDLP discretizer

Barry Becker (JIRA) Mon, 22 Aug 2016 11:17:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431341#comment-15431341
 ]


Barry Becker commented on SPARK-6509:
-------------------------------------

I may have missed the reasoning somewhere, but why was this marked wontfix? It 
seems like it would be a good addition.

> MDLP discretizer
> ----------------
>
>                 Key: SPARK-6509
>                 URL: https://issues.apache.org/jira/browse/SPARK-6509
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Sergio Ramírez
>
> Minimum Description Lenght Discretizer
> This method implements Fayyad's discretizer [1] based on Minimum Description 
> Length Principle (MDLP) in order to treat non discrete datasets from a 
> distributed perspective. We have developed a distributed version from the 
> original one performing some important changes.
> -- Improvements on discretizer:
>     Support for sparse data.
>     Multi-attribute processing. The whole process is carried out in a single 
> step when the number of boundary points per attribute fits well in one 
> partition (<= 100K boundary points per attribute).
>     Support for attributes with a huge number of boundary points (> 100K 
> boundary points per attribute). Rare situation.
> This software has been proved with two large real-world datasets such as:
>     A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 
> competition, which comes from the Protein Structure Prediction field 
> (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 
> 631 attributes, 2 classes, 98% of negative examples and occupies, when 
> uncompressed, about 56GB of disk space.
>     Epsilon dataset: 
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 
> 400K instances and 2K attributes
> We have demonstrated that our method performs 300 times faster than the 
> sequential version for the first dataset, and also improves the accuracy for 
> Naive Bayes.
> Publication: S. Ramírez-Gallego, S. García, H. Mouriño-Talin, D. 
> Martínez-Rego, V. Bolón, A. Alonso-Betanzos, J.M. Benitez, F. Herrera. "Data 
> Discretization: Taxonomy and Big Data Challenge", WIRES Data Mining and 
> Knowledge Discovery. In press, 2015.
> Design doc: 
> https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing
> References
> [1] Fayyad, U., & Irani, K. (1993).
> "Multi-interval discretization of continuous-valued attributes for 
> classification learning."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6509) MDLP discretizer

Reply via email to