Re: Regarding incubator systemml/breast_cancer project

2017-04-23 Thread Mike Dusenberry
Hi Aishwarya,

Glad to hear that the preprocessing stage was successful!  As for the
`MachineLearning.ipynb` notebook, here is a general guide:


   - The `MachineLearning.ipynb` notebook essentially (1) loads in the
   training and validation DataFrames from the preprocessing step, (2)
   converts them to normalized & one-hot encoded SystemML matrices for
   consumption by the ML algorithms, and (3) explores training a couple of
   models.
   - To run, you'll need to start Jupyter in the context of PySpark via
   `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
   PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
   $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have installed
   SystemML with pip from PyPy (`pip3 install systemml`), this will install
   our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar` will
   not be necessary.  If you instead have installed a bleeding-edge version of
   SystemML locally (git clone locally, maven build, `pip3 install -e
   src/main/python` as listed in `projects/breast_cancer/README.md`), the
   `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We are
   about to release 0.14, and for this project, I *would* recommend using a
   bleeding edge install.
   - Once Jupyter has been started in the context of PySpark, the `sc`
   SparkContext object should be available.  Please let me know if you
   continue to see this issue.
   - The "Read in train & val data" section simply reads in the training
   and validation data generated in the preprocessing stage.  Be sure that the
   `size` setting is the same as the preprocessing size.  The percentage `p`
   setting determines whether the full or sampled DataFrames are loaded.  If
   you set `p = 1`, the full DataFrames will be used.  If you instead would
   prefer to use the smaller sampled DataFrames while getting started, please
   set it to the same value as used in the preprocessing to generate the
   smaller sampled DataFrames.
   - The `Extract X & Y matrices` section splits each of the train and
   validation DataFrames into effectively X & Y matrices (still as DataFrame
   types), with X containing the images, and Y containing the labels.
   - The `Convert to SystemML Matrices` section passes the X & Y DataFrames
   into a SystemML script that performs some normalization of the images &
   one-hot encoding of the labels, and then returns SystemML `Matrix` types.
   These are now ready to be passed into the subsequent algorithms.
   - The "Trigger Caching" and "Save Matrices" are experimental features,
   and not necessary to execute.
   - Next comes the two algorithms being explored in this notebook.  The
   "Softmax Classifier" is just a multi-class logistic regression model, and
   is simply there to serve as a baseline comparison with the subsequent
   convolutional neural net model.  You may wish to simply skip this softmax
   model and move to the latter convnet model further down in the notebook.
   - The actual softmax model is located at [
   
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/softmax_clf.dml],
   and the notebook calls functions from that file.
   - The softmax sanity check just ensures that the model is able to
   completely overfit when given a tiny sample size.  This should yield ~100%
   training accuracy if the sample size in this section is small enough.  This
   is just a check to ensure that nothing else is wrong with the math or the
   data.
   - The softmax "Train" section will train a softmax model and return the
   weights (`W`) and biases (`b`) of the model as SystemML `Matrix` objects.
   Please adjust the hyperparameters in this section to your problem.
   - The softmax "Eval" section takes the trained weights and biases and
   evaluates the training and validation performance.
   - The next model is a LeNet-like convnet model.  The actual model is
   located at [
   
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/convnet.dml],
   and the notebook simply calls functions from that file.
   - Once again, there is an initial sanity check for the ability to
   overfit on a small amount of data.
   - The "Hyperparameter Search" contains a script to sample different
   hyperparams for the convnet, and save the hyperparams + validation accuracy
   of each set after a single epoch of training.  These string files will be
   saved to HDFS.  Please feel free to adjust the range of the hyperparameters
   for your problem.  Please also feel free to try using the `parfor`
   (parallel for-loop) instead of the while loop to speed up this section.
   Note that this is still a work in progress.  The hyperparameter tuning in
   this section makes use of random search (as opposed to grid search), which
   has been promoted by Bengio et al. to speed up the search time.
   - The "Train" section trains the convnet and returns the weights and
   biases as SystemML `Matrix` types.  In this s

Re: Please reply asap : Regarding incubator systemml/breast_cancer project

2017-04-23 Thread Aishwarya Chaurasia
Hey,

Thank you so much for your help sir. We were finally able to run
preprocess.py without any errors. And the results obtained were
satisfactory i.e we got five set of data frames like you said we would.

But alas! when we tried to run MachineLearning.ipynb the same NameError
came : https://paste.fedoraproject.org/paste/l3LFJreg~vnYEDTSTQH7
3l5M1UNdIGYhyRLivL9gydE=

Could you guide us again as to how to proceed now?
Also, could you please provide an overview of the process
MachineLearning.ipynb is following to train the samples.
Also we have tried all possible solutions to remove the name sc error.
It would be really kind of you if you looked into the matter asap.

Thanks a lot!

On 22-Apr-2017 5:19 PM, "Aishwarya Chaurasia" 
wrote:

> Hey,
>
> Thank you so much for your help sir. We were finally able to run
> preprocess.py without any errors. And the results obtained were
> satisfactory i.e we got five set of data frames like you said we would.
>
> But alas! when we tried to run MachineLearning.ipynb the same NameError
> came : https://paste.fedoraproject.org/paste/l3LFJreg~vnYEDTSTQH7
> 3l5M1UNdIGYhyRLivL9gydE=
>
> Could you guide us again as to how to proceed now?
> Also, could you please provide an overview of the process
> MachineLearning.ipynb is following to train the samples.
>
> Thanks a lot!
>
> On 20-Apr-2017 12:16 AM,  wrote:
>
>> Hi Aishwarya,
>>
>> Looks like you've just encountered an out of memory error on one of the
>> executors.  Therefore, you just need to adjust the `spark.executor.memory`
>> and `spark.driver.memory` settings with higher amounts of RAM.  What is
>> your current setup?  I.e. are you using a cluster of machines, or a single
>> machine?  We generally use a large driver on one machine, and then a single
>> large executor on each other machine.  I would give a sizable amount of
>> memory to the driver, and about half the possible memory on the executors
>> so that the Python processes have enough memory as well.  PySpark has JVM
>> and Python components, and the Spark memory settings only pertain to the
>> JVM side, thus the need to save about half the executor memory for the
>> Python side.
>>
>> Thanks!
>>
>> - Mike
>>
>> --
>>
>> Mike Dusenberry
>> GitHub: github.com/dusenberrymw
>> LinkedIn: linkedin.com/in/mikedusenberry
>>
>> Sent from my iPhone.
>>
>>
>> > On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
>> aishwarya2...@gmail.com> wrote:
>> >
>> > Hello sir,
>> >
>> > We also wanted to ensure that the spark-submit command we're using is
>> the
>> > correct one for running 'preprocess.py'.
>> > Command :  /home/new/sparks/bin/spark-submit preprocess.py
>> >
>> >
>> > Thank you.
>> > Aishwarya Chaurasia.
>> >
>> > On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" 
>> > wrote:
>> >
>> > Hello sir,
>> > On running the file preprocess.py we are getting the following error :
>> >
>> > https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
>> > YhyRLivL9gydE=
>> >
>> > Can you please help us by looking into the error and kindly tell us the
>> > solution for it.
>> > Thanks a lot.
>> > Aishwarya Chaurasia
>> >
>> >
>> >> On 19-Apr-2017 12:43 AM,  wrote:
>> >>
>> >> Hi Aishwarya,
>> >>
>> >> Certainly, here is some more detailed information about`preprocess.py`:
>> >>
>> >>  * The preprocessing Python script is located at
>> >> https://github.com/apache/incubator-systemml/blob/master/
>> >> projects/breast_cancer/preprocess.py.  Note that this is different
>> than
>> >> the library module at https://github.com/apache/incu
>> >> bator-systemml/blob/master/projects/breast_cancer/breastc
>> >> ancer/preprocessing.py.
>> >>  * This script is used to preprocess a set of histology slide images,
>> >> which are `.svs` files in our case, and `.tiff` files in your case.
>> >>  * Lines 63-79 contain "settings" such as the output image sizes,
>> folder
>> >> paths, etc.  Of particular interest, line 72 has the folder path for
>> the
>> >> original slide images that should be commonly accessible from all
>> machines
>> >> being used, and lines 74-79 contain the names of the output DataFrames
>> that
>> >> will be saved.
>> >>  * Line 82 performs the actual preprocessing and creates a Spark
>> >> DataFrame with the following columns: slide number, tumor score,
>> molecular
>> >> score, sample.  The "sample" in this case is the actual small,
>> chopped-up
>> >> section of the image that has been extracted and flattened into a row
>> >> Vector.  For test images without labels (`training=false`), only the
>> slide
>> >> number and sample will be contained in the DataFrame (i.e. no labels).
>> >> This calls the `preprocess(...)` function located on line 371 of
>> >> https://github.com/apache/incubator-systemml/blob/master/
>> >> projects/breast_cancer/breastcancer/preprocessing.py, which is a
>> >> different file.
>> >>  * Line 87 simply saves the above DataFrame to HDFS with the name from
>> >> line 74.
>> >>  * Line 93 splits the above DataFrame row-wise into separate "train

Re: Vector of Matrix

2017-04-23 Thread arijit chakraborty
Thanks Matthias for your reply! It would be great if we have this functionality 
in systemML.


Regards,

Arijit



From: Matthias Boehm 
Sent: Saturday, April 22, 2017 12:18 AM
To: dev@systemml.incubator.apache.org
Subject: Re: Vector of Matrix

no, right now, we don't support structs or complex objects.

Regards,
Matthias

On 4/21/2017 4:17 AM, arijit chakraborty wrote:
> Hi,
>
>
> In R (as well as in python), we can store values list within list. Say I've 2 
> matrix with different dimensions,
>
> x <- matrix(1:10, ncol=2)
> y <- matrix(1:5, ncol=1)
>
>
> FinalList <- c(x, y)
>
>
> Is it possible to do such form in systemML? I'm not looking for cbind or 
> rbind.
>
>
> Thank you!
>
> Arijit
>


Jenkins build is back to normal : SystemML-DailyTest #946

2017-04-23 Thread jenkins
See