[ 
https://issues.apache.org/jira/browse/SPARK-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Ramírez updated SPARK-6509:
----------------------------------
    Description: 
Minimum Description Lenght Discretizer

This method implements Fayyad's discretizer [1] based on Minimum Description 
Length Principle (MDLP) in order to treat non discrete datasets from a 
distributed perspective. We have developed a distributed version from the 
original one performing some important changes.

-- Improvements on discretizer:

    Support for sparse data.
    Multi-attribute processing. The whole process is carried out in a single 
step when the number of boundary points per attribute fits well in one 
partition (<= 100K boundary points per attribute).
    Support for attributes with a huge number of boundary points (> 100K 
boundary points per attribute). Rare situation.

This software has been proved with two large real-world datasets such as:

    A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 
competition, which comes from the Protein Structure Prediction field 
(http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 631 
attributes, 2 classes, 98% of negative examples and occupies, when 
uncompressed, about 56GB of disk space.
    Epsilon dataset: 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 
400K instances and 2K attributes

We have demonstrated that our method performs 300 times faster than the 
sequential version for the first dataset, and also improves the accuracy for 
Naive Bayes.

Publication: S. Ramírez-Gallego, S. García, H. Mouriño-Talin, D. Martínez-Rego, 
V. Bolón, A. Alonso-Betanzos, J.M. Benitez, F. Herrera. "Distributed Entropy 
Minimization Discretizer for Big Data Analysis under Apache Spark". IEEE 
BigDataSE Conference, Helsinki, August, 2015. 

Design doc: 
https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing

References

[1] Fayyad, U., & Irani, K. (1993).
"Multi-interval discretization of continuous-valued attributes for 
classification learning."

  was:
Minimum Description Lenght Discretizer

This method implements Fayyad's discretizer [1] based on Minimum Description 
Length Principle (MDLP) in order to treat non discrete datasets from a 
distributed perspective. We have developed a distributed version from the 
original one performing some important changes.

-- Improvements on discretizer:

    Support for sparse data.
    Multi-attribute processing. The whole process is carried out in a single 
step when the number of boundary points per attribute fits well in one 
partition (<= 100K boundary points per attribute).
    Support for attributes with a huge number of boundary points (> 100K 
boundary points per attribute). Rare situation.

This software has been proved with two large real-world datasets such as:

    A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 
competition, which comes from the Protein Structure Prediction field 
(http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 631 
attributes, 2 classes, 98% of negative examples and occupies, when 
uncompressed, about 56GB of disk space.
    Epsilon dataset: 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 
400K instances and 2K attributes

We have demonstrated that our method performs 300 times faster than the 
sequential version for the first dataset, and also improves the accuracy for 
Naive Bayes.

Design doc: 
https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing

References

[1] Fayyad, U., & Irani, K. (1993).
"Multi-interval discretization of continuous-valued attributes for 
classification learning."


> MDLP discretizer
> ----------------
>
>                 Key: SPARK-6509
>                 URL: https://issues.apache.org/jira/browse/SPARK-6509
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Sergio Ramírez
>
> Minimum Description Lenght Discretizer
> This method implements Fayyad's discretizer [1] based on Minimum Description 
> Length Principle (MDLP) in order to treat non discrete datasets from a 
> distributed perspective. We have developed a distributed version from the 
> original one performing some important changes.
> -- Improvements on discretizer:
>     Support for sparse data.
>     Multi-attribute processing. The whole process is carried out in a single 
> step when the number of boundary points per attribute fits well in one 
> partition (<= 100K boundary points per attribute).
>     Support for attributes with a huge number of boundary points (> 100K 
> boundary points per attribute). Rare situation.
> This software has been proved with two large real-world datasets such as:
>     A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 
> competition, which comes from the Protein Structure Prediction field 
> (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 
> 631 attributes, 2 classes, 98% of negative examples and occupies, when 
> uncompressed, about 56GB of disk space.
>     Epsilon dataset: 
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 
> 400K instances and 2K attributes
> We have demonstrated that our method performs 300 times faster than the 
> sequential version for the first dataset, and also improves the accuracy for 
> Naive Bayes.
> Publication: S. Ramírez-Gallego, S. García, H. Mouriño-Talin, D. 
> Martínez-Rego, V. Bolón, A. Alonso-Betanzos, J.M. Benitez, F. Herrera. 
> "Distributed Entropy Minimization Discretizer for Big Data Analysis under 
> Apache Spark". IEEE BigDataSE Conference, Helsinki, August, 2015. 
> Design doc: 
> https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing
> References
> [1] Fayyad, U., & Irani, K. (1993).
> "Multi-interval discretization of continuous-valued attributes for 
> classification learning."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to