[ https://issues.apache.org/jira/browse/SYSTEMML-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Dusenberry updated SYSTEMML-1185: -------------------------------------- Attachment: approach.svg > SystemML Breast Cancer Project > ------------------------------ > > Key: SYSTEMML-1185 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1185 > Project: SystemML > Issue Type: New Feature > Reporter: Mike Dusenberry > Assignee: Mike Dusenberry > Attachments: approach.svg > > > h1. Predicting Breast Cancer Proliferation Scores with Apache Spark and > Apache SystemML > h3. Overview > The [Tumor Proliferation Assessment Challenge 2016 (TUPAC16) | > http://tupac.tue-image.nl/] is a "Grand Challenge" that was created for the > [2016 Medical Image Computing and Computer Assisted Intervention (MICCAI > 2016) | http://miccai2016.org/en/] conference. In this challenge, the goal > is to develop state-of-the-art algorithms for automatic prediction of tumor > proliferation scores from whole-slide histopathology images of breast tumors. > h3. Background > Breast cancer is the leading cause of cancerous death in women in > less-developed countries, and is the second leading cause of cancerous deaths > in developed countries, accounting for 29% of all cancers in women within the > U.S. \[1]. Survival rates increase as early detection increases, giving > incentive for pathologists and the medical world at large to develop improved > methods for even earlier detection \[2]. There are many forms of breast > cancer including Ductal Carcinoma in Situ (DCIS), Invasive Ductal Carcinoma > (IDC), Tubular Carcinoma of the Breast, Medullary Carcinoma of the Breast, > Invasive Lobular Carcinoma, Inflammatory Breast Cancer and several others > \[3]. Within all of these forms of breast cancer, the rate in which breast > cancer cells grow (proliferation), is a strong indicator of a patient’s > prognosis. Although there are many means of determining the presence of > breast cancer, tumor proliferation speed has been proven to help pathologists > determine the treatment for the patient. The most common technique for > determining the proliferation speed is through mitotic count (mitotic index) > estimates, in which a pathologist counts the dividing cell nuclei in > hematoxylin and eosin (H&E) stained slide preparations to determine the > number of mitotic bodies. Given this, the pathologist produces a > proliferation score of either 1, 2, or 3, ranging from better to worse > prognosis \[4]. Unfortunately, this approach is known to have reproducibility > problems due to the variability in counting, as well as the difficulty in > distinguishing between different grades. > References: > \[1] http://emedicine.medscape.com/article/1947145-overview#a3 > \[2] http://emedicine.medscape.com/article/1947145-overview#a7 > \[3] http://emedicine.medscape.com/article/1954658-overview > \[4] http://emedicine.medscape.com/article/1947145-workup#c12 > h3. Goal & Approach > In an effort to automate the process of classification, this project aims to > develop a large-scale deep learning approach for predicting tumor scores > directly from the pixels of whole-slide histopathology images. Our proposed > approach is based on a recent research paper from Stanford \[1]. Starting > with 500 extremely high-resolution tumor slide images with accompanying score > labels, we aim to make use of Apache Spark in a preprocessing step to cut and > filter the images into smaller square samples, generating 4.7 million samples > for a total of ~7TB of data \[2]. We then utilize Apache SystemML on top of > Spark to develop and train a custom, large-scale, deep convolutional neural > network on these samples, making use of the familiar linear algebra syntax > and automatically-distributed execution of SystemML \[3]. Our model takes as > input the pixel values of the individual samples, and is trained to predict > the correct tumor score classification for each one. In addition to > distributed linear algebra, we aim to exploit task-parallelism via parallel > for-loops for hyperparameter optimization, as well as hardware acceleration > for faster training via a GPU-backed runtime. Ultimately, we aim to develop > a model that is sufficiently stronger than existing approaches for the task > of breast cancer tumor proliferation score classification. > References: > \[1] https://web.stanford.edu/group/rubinlab/pubs/2243353.pdf > \[2] See [{{Preprocessing.ipynb}} | > https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/Preprocessing.ipynb]. > > \[3] See [{{MachineLearning.ipynb}} | > https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/MachineLearning.ipynb], > [{{softmax_clf.dml}} | > https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/softmax_clf.dml], > and [{{convnet.dml}} | > https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/convnet.dml]. > > !approach.svg! > ---- > h2. Systems Tasks > From a systems perspective, we aim to support multi-node, multi-GPU > distributed SGD training to support large-scale experiments for the specific > breast cancer use case. > To achieve this goal, the following steps as necessary: > # Single-node, CPU mini-batch SGD training (1 mini-batch at a time). > # Single-node, single-GPU mini-batch SGD training (1 mini-batch at a time). > # Single-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel > mini-batches for `n` GPUs at a time). > # Multi-node, CPU data-parallel mini-batch SGD training (`n` parallel > mini-batches for `n` parallel tasks at a time). > # Multi-node, single-GPU data-parallel mini-batch SGD training (`n` parallel > mini-batches for `n` total GPUs across the cluster at a time). > # Multi-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel > mini-batches for `n` total GPUs across the cluster at a time). > ---- > Here is a list of past and present JIRA epics and issues that have blocked, > or are currently blocking progress on the breast cancer project. > > Overall Deep Learning Epic > * https://issues.apache.org/jira/browse/SYSTEMML-540 > *This is the overall "Deep Learning" JIRA epic, with all issues either > within or related to the epic. > Past > * https://issues.apache.org/jira/browse/SYSTEMML-633 > * https://issues.apache.org/jira/browse/SYSTEMML-951 > ** Issue that completely blocked mini-batch training approaches. > * https://issues.apache.org/jira/browse/SYSTEMML-914 > ** Epic containing issues related to input DataFrame conversions that > blocked getting data into the system entirely. Most of the issues > specifically refer to existing, internal converters. 993 was a particularly > large issue, and triggered a large body of work related to internal memory > estimates that were incorrect. Also see 919, 946, & 994. > * https://issues.apache.org/jira/browse/SYSTEMML-1076 > * https://issues.apache.org/jira/browse/SYSTEMML-1077 > * https://issues.apache.org/jira/browse/SYSTEMML-948 > Present > * https://issues.apache.org/jira/browse/SYSTEMML-1160 > ** Current open blocker to efficiently using a stochastic gradient descent > approach. > * https://issues.apache.org/jira/browse/SYSTEMML-1078 > ** Current open blocker to training even an initial deep learning model for > the project. This is another example of an internal compiler bug. > * https://issues.apache.org/jira/browse/SYSTEMML-686 > ** We need distributed convolution and max pooling operators. > * https://issues.apache.org/jira/browse/SYSTEMML-1159 > ** This is the main issue that discusses the need for the `parfor` > construct to support efficient, parallel hyperparameter tuning on a cluster > with large datasets. The broken remote parfor in 1129 blocked this issue, > which in turned blocked any meaningful work on training a deep neural net for > the project. > * https://issues.apache.org/jira/browse/SYSTEMML-1142 > ** This was one of the blockers to doing hyperparameter tuning. > * https://issues.apache.org/jira/browse/SYSTEMML-1129 > ** This is an epic for the issue in which the `parfor` construct was broken > for remote Spark cases, and was one of the blockers for doing hyperparameter > tuning. -- This message was sent by Atlassian JIRA (v6.3.15#6346)