[ https://issues.apache.org/jira/browse/SPARK-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431341#comment-15431341 ]
Barry Becker commented on SPARK-6509: ------------------------------------- I may have missed the reasoning somewhere, but why was this marked wontfix? It seems like it would be a good addition. > MDLP discretizer > ---------------- > > Key: SPARK-6509 > URL: https://issues.apache.org/jira/browse/SPARK-6509 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Sergio Ramírez > > Minimum Description Lenght Discretizer > This method implements Fayyad's discretizer [1] based on Minimum Description > Length Principle (MDLP) in order to treat non discrete datasets from a > distributed perspective. We have developed a distributed version from the > original one performing some important changes. > -- Improvements on discretizer: > Support for sparse data. > Multi-attribute processing. The whole process is carried out in a single > step when the number of boundary points per attribute fits well in one > partition (<= 100K boundary points per attribute). > Support for attributes with a huge number of boundary points (> 100K > boundary points per attribute). Rare situation. > This software has been proved with two large real-world datasets such as: > A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 > competition, which comes from the Protein Structure Prediction field > (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, > 631 attributes, 2 classes, 98% of negative examples and occupies, when > uncompressed, about 56GB of disk space. > Epsilon dataset: > http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. > 400K instances and 2K attributes > We have demonstrated that our method performs 300 times faster than the > sequential version for the first dataset, and also improves the accuracy for > Naive Bayes. > Publication: S. Ramírez-Gallego, S. García, H. Mouriño-Talin, D. > Martínez-Rego, V. Bolón, A. Alonso-Betanzos, J.M. Benitez, F. Herrera. "Data > Discretization: Taxonomy and Big Data Challenge", WIRES Data Mining and > Knowledge Discovery. In press, 2015. > Design doc: > https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing > References > [1] Fayyad, U., & Irani, K. (1993). > "Multi-interval discretization of continuous-valued attributes for > classification learning." -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org