On Mon, Oct 1, 2018 at 12:18 PM Girish Vasmatkar < girish.vasmat...@hotwaxsystems.com> wrote:
> Hi All > > We are very early into our Spark days so the following may sound like a > novice question :) I will try to keep this as short as possible. > > We are trying to use Spark to introduce a recommendation engine that can > be used to provide product recommendations and need help on some design > decisions before moving forward. Ours is a web application running on > Tomcat. So far, I have created a simple POC (standalone java program) that > reads in a CSV file and feeds to FPGrowth and then fits the data and runs > transformations. I would like to be able to do the following - > > > - Scheduler runs nightly in Tomcat (which it does currently) and reads > everything from the DB to train/fit the system. This can grow into really > some large data and everyday we will have new data. Should I just use > SparkContext here, within my scheduler, to FIT the system? Is this correct > way to go about this? I am also planning to save the model on S3 which > should be okay. We also thought on using HDFS. The scheduler's job will be > just to create model and save the same and be done with it. > - On the product page, we can then use the saved model to display the > product recommendations for a particular product. > - My understanding is that I should be able to use SparkContext here > in my web application to just load the saved model and use it to derive the > recommendations. Is this a good design? The problem I see using this > approach is that the SparkContext does take time to initialize and this may > cost dearly. Or should we keep SparkContext per web application to use a > single instance of the same? We can initialize a SparkContext during > application context initializaion phase. > > > Since I am fairly new to using Spark properly, please help me take > decision on whether the way I plan to use Spark is the recommended way? I > have also seen use cases involving kafka tha does communication with Spark, > but can we not do it directly using Spark Context? I am sure a lot of my > understanding is wrong, so please feel free to correct me. > > Thanks and Regards, > Girish Vasmatkar > HotWax Systems > > > >