Re: [Architecture] Adding RNN to WSO2 Machine Learner

Thamali Wijewardhana Thu, 21 Apr 2016 21:58:03 -0700

Hi Imesh,
Thanks a lot for the comments and it was really helpful.

On Fri, Apr 22, 2016 at 6:33 AM, Nirmal Fernando <nir...@wso2.com> wrote:


> [Removed architecture@]
> Will do.
>
> On Fri, Apr 22, 2016 at 12:05 AM, Yudhanjaya Wijeratne <
> yudhanj...@wso2.com> wrote:
>
>> Hi Nirmal, Thamali has briefed me on this article. Please provide
>> technical review? I'll do the grammar once you've approved it.
>> Best, Yudha
>> On Apr 21, 2016 6:03 PM, "Thamali Wijewardhana" <tham...@wso2.com> wrote:
>>
>>> Hi,
>>>
>>> I have completed writing the article[1] containing the comparison
>>> between the deeplearning4j library and Keras library considering Recurrent
>>> Neural network(RNN) algorithm.
>>> I also have found out the reasons for low performance of Deeplearning4j
>>> library using Java Flight Recorder(JFR) and Flame Graphs and included in
>>> the article.
>>>
>>> [1]
>>> https://docs.google.com/a/wso2.com/document/d/1CGq1y5QBzW6EaHyf-UqAiatxLumb6lo_mRLjYZWD18o/edit?usp=sharing
>>>
>>> Thanks
>>>
>>>
>>> On Fri, Apr 8, 2016 at 7:20 PM, Thamali Wijewardhana <tham...@wso2.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have used a dataset with 25000 rows and the size is 80 MB.
>>>>
>>>> The link to the dataset is:
>>>>
>>>> http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 8, 2016 at 3:07 PM, Srinath Perera <srin...@wso2.com>
>>>> wrote:
>>>>
>>>>> Thamali, how big is the data set you are using?  ( give me a link to
>>>>> the data set as well).
>>>>>
>>>>> Nirmal, shall we compare the accuracy of RNN vs. Upul's rolling window
>>>>> method?
>>>>>
>>>>> --Srinath
>>>>>
>>>>> On Fri, Apr 8, 2016 at 9:23 AM, Thamali Wijewardhana <tham...@wso2.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I run the RNN algorithm using deeplearning4j library and the Keras
>>>>>> python library. The dataset, hyper parameters, network architecture and 
>>>>>> the
>>>>>> hardware platform are the same. Given below is the time comparison
>>>>>>
>>>>>> Deeplearning4j library-40 minutes per 1 epoch
>>>>>> Keras library- 4 minutes per 1 epoch
>>>>>>
>>>>>> I also compared the accuracies[1]. The deeplearning4j library gives a
>>>>>> low accuracy compared to Keras library.
>>>>>>
>>>>>> [1]
>>>>>> https://docs.google.com/spreadsheets/d/1-EvC1P7N90k1S_Ly6xVcFlEEKprh7r41Yk8aI6DiSaw/edit#gid=1050346562
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 10:12 AM, Thamali Wijewardhana <
>>>>>> tham...@wso2.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I have organized a review on Monday (4th  of April).
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Thu, Mar 31, 2016 at 3:21 PM, Srinath Perera <srin...@wso2.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Please setup a review. Shall we do it monday?
>>>>>>>>
>>>>>>>> On Thu, Mar 31, 2016 at 2:15 PM, Thamali Wijewardhana <
>>>>>>>> tham...@wso2.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> we have created a spark program to prove the feasibility of adding
>>>>>>>>> the RNN algorithm to machine learner.
>>>>>>>>> This program demonstrates all the steps in machine learner:
>>>>>>>>>
>>>>>>>>> Uploading a dataset
>>>>>>>>>
>>>>>>>>> Selecting the hyper parameters for the model
>>>>>>>>>
>>>>>>>>> Creating a RNN model using data and training the model
>>>>>>>>>
>>>>>>>>> Calculating the accuracy of the model
>>>>>>>>>
>>>>>>>>> Saving the model(As a serialization object)
>>>>>>>>>
>>>>>>>>> predicting using the model
>>>>>>>>>
>>>>>>>>> This program is based on deeplearning4j and apache spark pipeline.
>>>>>>>>> Deeplearning4j was used as the deep learning library for recurrent 
>>>>>>>>> neural
>>>>>>>>> network algorithm. As the program should be based on the Spark 
>>>>>>>>> pipeline,
>>>>>>>>> the main challenge was to use deeplearning4j library with spark 
>>>>>>>>> pipeline.
>>>>>>>>> The components used in the spark pipeline should be compatible with 
>>>>>>>>> spark
>>>>>>>>> pipeline. For other components which are not compatible with spark
>>>>>>>>> pipeline, we have to wrap them with a org.apache.spark.predictionModel
>>>>>>>>> object.
>>>>>>>>>
>>>>>>>>> We have designed a pipeline with sequence of stages (transformers
>>>>>>>>> and estimators):
>>>>>>>>>
>>>>>>>>> 1. Tokenizer:Transformer-Split each sequential data to tokens.(For
>>>>>>>>> example, in sentiment analysis, split text into words)
>>>>>>>>>
>>>>>>>>> 2. Vectorizer :Transformer-Transforms features into vectors.
>>>>>>>>>
>>>>>>>>> 3. RNN algorithm :Estimator -RNN algorithm which trains on a data
>>>>>>>>> frame and produces a RNN model
>>>>>>>>>
>>>>>>>>> 4. RNN model : Transformer- Transforms data frame with features to
>>>>>>>>> data frame with predictions.
>>>>>>>>>
>>>>>>>>> The diagrams below explains the stages of the pipeline. The first
>>>>>>>>> diagram illustrates the training usage of the pipeline and the next 
>>>>>>>>> diagram
>>>>>>>>> illustrates the testing and predicting usage of a pipeline.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I also have tuned the RNN model for hyper parameters[1] and found
>>>>>>>>> the values of hyper parameters which optimizes accuracy of the model.
>>>>>>>>> Give below is the set of hyper parameters relevant to RNN
>>>>>>>>> algorithm and the tuned values.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Number of epochs-10
>>>>>>>>>
>>>>>>>>> Number of iterations- 1
>>>>>>>>>
>>>>>>>>> Learning rate-0.02
>>>>>>>>>
>>>>>>>>> We used the aclImdb sentiment analysis data set for this program
>>>>>>>>> and with the above hyper parameters, we could achieve 60% accuracy. 
>>>>>>>>> And we
>>>>>>>>> are trying to improve the accuracy and efficiency of our algorithm.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://docs.google.com/spreadsheets/d/1Wcta6i2k4Je_5l16wCVlH6zBMNGIb-d7USaWdbrkrSw/edit?ts=56fcdc9b#gid=2118685173
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 25, 2016 at 10:18 AM, Thamali Wijewardhana <
>>>>>>>>> tham...@wso2.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> One of the most important obstacles in machine learning and deep
>>>>>>>>>> learning is getting data into a format that neural nets can 
>>>>>>>>>> understand.
>>>>>>>>>> Neural nets understand vectors. Therefore, vectorization is an 
>>>>>>>>>> important
>>>>>>>>>> part in building neural network algorithms.
>>>>>>>>>>
>>>>>>>>>> Canova is a Vectorization library for Machine Learning which is
>>>>>>>>>> associated with deeplearning4j library. It is designed to support 
>>>>>>>>>> all major
>>>>>>>>>> types of input data such as text,csv,image,audio,video and etc.
>>>>>>>>>>
>>>>>>>>>> In our project to add RNN for Machine Learner, we have to use a
>>>>>>>>>> vectorizing component to convert input data to vectors. I think that 
>>>>>>>>>> Canova
>>>>>>>>>> is a better to build a generic vectorizing component. I am 
>>>>>>>>>> researching on
>>>>>>>>>> using Canova for the vectorizing purpose.
>>>>>>>>>>
>>>>>>>>>> Any suggestions on this are highly appreciated.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 2, 2016 at 2:25 PM, Thamali Wijewardhana <
>>>>>>>>>> tham...@wso2.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Srinath,
>>>>>>>>>>>
>>>>>>>>>>> We have decided to  implement only classification first. Once we
>>>>>>>>>>> complete the classification, we hope to do next value prediction 
>>>>>>>>>>> too.
>>>>>>>>>>> We are basically trying to implement a program to make sure that
>>>>>>>>>>> the deeplearning4j library we are using is compatible with apache 
>>>>>>>>>>> spark
>>>>>>>>>>> pipeline. And also we are trying to demonstrate all the machine 
>>>>>>>>>>> learning
>>>>>>>>>>> steps with that program.
>>>>>>>>>>>
>>>>>>>>>>> We are now using aclImdb sentiment analysis data set to verify
>>>>>>>>>>> the accuracy of the RNN model we create.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Thamali
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 2, 2016 at 10:38 AM, Srinath Perera <
>>>>>>>>>>> srin...@wso2.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Thamali,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    1. RNN can do both classification and predict next value.
>>>>>>>>>>>>    Are we trying to do both?
>>>>>>>>>>>>    2. When Upul played with it, he had trouble getting
>>>>>>>>>>>>    deeplearning4j implementation work with predict next value 
>>>>>>>>>>>> scenario. Is it
>>>>>>>>>>>>    fixed?
>>>>>>>>>>>>    3. What are the data sets we will use to verify the
>>>>>>>>>>>>    accuracy of RNN after integration?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --Srinath
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 1, 2016 at 3:44 PM, Thamali Wijewardhana <
>>>>>>>>>>>> tham...@wso2.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently we are working on a project to add Recurrent Neural
>>>>>>>>>>>>> Network(RNN) algorithm to machine learner. RNN is one of deep 
>>>>>>>>>>>>> learning
>>>>>>>>>>>>> algorithms with record breaking accuracy. For more information on 
>>>>>>>>>>>>> RNN
>>>>>>>>>>>>> please refer link[1].
>>>>>>>>>>>>>
>>>>>>>>>>>>> We have decided to use deeplearning4j which is an open source
>>>>>>>>>>>>> deep learning library scalable on spark and Hadoop.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since there is a plan to add spark pipeline to machine
>>>>>>>>>>>>> Learner, we have decided to use spark pipeline concept to our 
>>>>>>>>>>>>> project.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have designed an architecture for the RNN implementation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This architecture is developed to be compatible with spark
>>>>>>>>>>>>> pipeline.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Data set is taken in csv format and then it is converted to
>>>>>>>>>>>>> spark data frame since apache spark works mostly with data frames.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Next step is a transformer which is needed to tokenize the
>>>>>>>>>>>>> sequential data. A tokenizer is basically used for take a 
>>>>>>>>>>>>> sequence of data
>>>>>>>>>>>>> and break it into individual units. For example, it can be used 
>>>>>>>>>>>>> to break
>>>>>>>>>>>>> the words in a sentence to words.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Next step is again a transformer used to converts tokens to
>>>>>>>>>>>>> vectors. This must be done because the features should be added 
>>>>>>>>>>>>> to spark
>>>>>>>>>>>>> pipeline in org.apache.spark.mllib.linlag.VectorUDT format.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Next, the transformed data are fed to the data set iterator.
>>>>>>>>>>>>> This is an object of a class which implement
>>>>>>>>>>>>> org.deeplearning4j.datasets.iterator.DataSetIterator. The dataset 
>>>>>>>>>>>>> iterator
>>>>>>>>>>>>> traverses through a data set and prepares data for neural 
>>>>>>>>>>>>> networks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Next component is the RNN algorithm model which is an
>>>>>>>>>>>>> estimator. The iterated data from data set iterator is fed to RNN 
>>>>>>>>>>>>> and a
>>>>>>>>>>>>> model is generated. Then this model can be used for predictions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We have decided to complete this project in two steps :
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>>    First create a spark pipeline program containing the steps
>>>>>>>>>>>>>    in machine learner(uploading dataset, generate model, 
>>>>>>>>>>>>> calculating accuracy
>>>>>>>>>>>>>    and prediction) and check whether the project is feasible.
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Next add the algorithm to ML
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently we have almost completed the first step and now we
>>>>>>>>>>>>> are collecting more data and tuning for hyper parameters.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://docs.google.com/document/d/1edg1fdKCYR7-B1oOLy2kon179GSs6x2Zx9oSRDn_NEU/edit
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> ============================
>>>>>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> ============================
>>>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>>>> Site: http://home.apache.org/~hemapani/
>>>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>>>> Phone: 0772360902
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ============================
>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>> Site: http://home.apache.org/~hemapani/
>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>> Phone: 0772360902
>>>>>
>>>>
>>>>
>>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Adding RNN to WSO2 Machine Learner

Reply via email to