@Saikat - One thing I shall say is that REST is slow. There is latency because of deserialization overhead. For very large datasets probably not very good to use REST.
> On Apr 30, 2016, at 2:35 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote: > > Andrew et al,I wanted to ask about a few items while I'm researching my dev > proposal, so what I'm looking to build is a streaming analytics platform to > do things like collaborative filtering and anomaly detection on large amounts > of streaming data that are either generated from events (kafka) or through a > firehose like Amazon Kinesis, my initial thinking is that this pipe of > events/data would be connected to a rest API that sits on top of mahout, the > backend underneath mahout would use a hybrid form of spark as well as spark > streaming, I'm wondering whether Samsara was designed from the ground up to > deal with large amounts of streaming data or whether this is not a use case > targeted yet. My goal is to build a platform with several data sources/sinks > and produce intermediate checkpoints where transformations are applied to the > data before once again sending to a set of sinks/sources. Therefore the > potential fits into and out of mahout include: > 1) A rest API that leverages spray and akka and invokes one or more > algorithms in mahout2) A runtime environment with scala actors that allows > one to either ingest data or perform transformations on data through the use > of various classification and clustering algorithms, the runtime environment > would ingest algorithms using mahout as a library3) A rich set of actors > dealing with various no sql and graph based datastores > (cassandra/neo4j/titan/mongo) > > Some insight into Samsara would be great as I'm trying to understand the > entry points into mahout. > Thanks in advance. > >> From: ap....@outlook.com >> To: dev@mahout.apache.org >> Subject: Re: Mahout contributions >> Date: Thu, 28 Apr 2016 21:43:19 +0000 >> >> I don't think that this sort of of integration work would be a good fit >> directly to the Mahout project. Mahout is more about math, algorithms and >> an environment to develop algorithms. We stay away from direct platform >> integration. In the past we did have some elasticsearch/mahout integration >> work that is not in the code base for this exact reason. I would suggest >> that better places to contribute something like this may be: PIO >> (https://prediction.io/), or even directly as a package for spark >> http://spark-packages.org/ . >> >> Recent projects integrating Mahout have recently been added to PIO: >> https://github.com/PredictionIO/template-scala-parallel-universal-recommendation. >> >> >> I think that the project that you are proposing would be a better fit there. >> >> Thanks, >> >> Andy >> >> >> ________________________________________ >> From: Saikat Kanjilal <sxk1...@hotmail.com> >> Sent: Thursday, April 28, 2016 1:50 PM >> To: dev@mahout.apache.org >> Subject: Re: Mahout contributions >> >> I want to start with social data as an example, for example data returned >> from FB graph API as well user Twitter data, will send some samples later if >> you're interested. >> >> Sent from my iPhone >> >>> On Apr 28, 2016, at 10:41 AM, Khurrum Nasim <khurrum.na...@useitc.com> >>> wrote: >>> >>> >>> What type of JSON payload size are we talking about here ? >>> >>>> On Apr 28, 2016, at 1:32 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote: >>>> >>>> Because EL gives you the visualization and non Lucene type query >>>> constructs as well and also that it already has a rest API that I plan on >>>> tying into mahout. I plan on wrapping some of the clustering algorithms >>>> that I implement using Mahout and Spark as a service which can then make >>>> calls into other services (namely elasticsearch and neo4j graph service). >>>> >>>> Sent from my iPhone >>>> >>>>> On Apr 28, 2016, at 10:22 AM, Khurrum Nasim <khurrum.na...@useitc.com> >>>>> wrote: >>>>> >>>>> @Saikat- why use EL instead of Lucene directly. >>>>> >>>>> >>>>> >>>>>> On Apr 28, 2016, at 12:08 PM, Saikat Kanjilal <sxk1...@hotmail.com> >>>>>> wrote: >>>>>> >>>>>> This is great information thank you, based on this recommendation I >>>>>> won't create a JIRA but start work on my project and when the code >>>>>> approaches the percentages you are describing I will create the >>>>>> appropriate JIRA's and put together a proposal to send to the list, >>>>>> sound ok? Based on your latest updates to the wiki i will work on a >>>>>> handful of the clustering algorithms since I see that the Spark >>>>>> implementations for these are not yet complete. >>>>>> Thank you again >>>>>> >>>>>>> From: ap....@outlook.com >>>>>>> To: dev@mahout.apache.org >>>>>>> Subject: Re: Mahout contributions >>>>>>> Date: Thu, 28 Apr 2016 01:31:09 +0000 >>>>>>> >>>>>>> Saikat, >>>>>>> >>>>>>> One other thing that I should say is that you do not need clearance or >>>>>>> input from the committers to begin work on your project, and the >>>>>>> interest can and should come from the community as a whole. You can >>>>>>> write proposal as you've done, and if you don't see any "+1"s or >>>>>>> responses from the community at whole with in a few days, you may want >>>>>>> to explain in more detail, give examples and use cases. If you are >>>>>>> still not seeing +1s or any responses from others then I think you can >>>>>>> assume that there may not be interest; this is usually how things work. >>>>>>> >>>>>>> However if its something that your passionate about and you feel like >>>>>>> you can deliver this should not to stop you. People do not always read >>>>>>> the dev@ emails or have time to respond. You can still move forward >>>>>>> with your proposed contribution by following the steps laid out in my >>>>>>> previous email; follow the protocol at: >>>>>>> >>>>>>> http://mahout.apache.org/developers/how-to-contribute.html >>>>>>> >>>>>>> and create a JIRA. When you have reached a significant amount of >>>>>>> completion (around 70-80%), open a PR for review, this way you can >>>>>>> explain in more detail. >>>>>>> >>>>>>> But please realize that when you open a JIRA for a new issue there is >>>>>>> some expectation of a commitment on your part to complete it. >>>>>>> >>>>>>> For example, I am currently investigating some new plotting features. >>>>>>> I have spent a good deal of time this week and last already and am even >>>>>>> mocking up code as a sketch of what may become an implementation before >>>>>>> I open a "New Feature" JIRA for it. >>>>>>> >>>>>>> My point is absolutely not to discourage you or anybody else from >>>>>>> opening JIRAs for new features, rather to let you know that when you >>>>>>> open an JIRA for a new issue, It tells others that your are working on >>>>>>> it, and thus may discourage another with a similar idea to contribute >>>>>>> this feature. So it is best to open it once you've begun your work and >>>>>>> are committed to it. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> ________________________________________ >>>>>>> From: Saikat Kanjilal <sxk1...@hotmail.com> >>>>>>> Sent: Wednesday, April 27, 2016 8:24 PM >>>>>>> To: dev@mahout.apache.org >>>>>>> Subject: RE: Mahout contributions >>>>>>> >>>>>>> Andrew,Thank you very much for your input, I actually want to start a >>>>>>> new set of JIRAs, here's what I want to work on, I want to build a >>>>>>> framework that ties together search/visualization capability with some >>>>>>> machine learning algorithms, so essentially think of it as tying in >>>>>>> elasticsearch and kibana into mahout , the user can search for their >>>>>>> data with elasticsearch and for deeper analysis on that data they can >>>>>>> feed that data into one or more mahout backends for analysis. Another >>>>>>> interesting tie in might be to hack kibana to render ggplot like >>>>>>> graphics based on the output of mahout algorithms (assuming this can be >>>>>>> a kibana plugin). >>>>>>> Before I go hog wild to create a bunch of JIRA's I'd like to know if >>>>>>> there's interest in this initiative. The tool will bring together the >>>>>>> ELK stack with dynamic machine learning algorithms. I can go into a >>>>>>> lot more detail around use cases if there's enough interest. >>>>>>> Looking forward to your and other committers input.Thanks >>>>>>> >>>>>>>> From: ap....@outlook.com >>>>>>>> To: dev@mahout.apache.org >>>>>>>> Subject: Re: Mahout contributions >>>>>>>> Date: Wed, 27 Apr 2016 20:16:38 +0000 >>>>>>>> >>>>>>>> Hello Saikat, >>>>>>>> >>>>>>>> #1 and #2 above are already implemented. #4 is tricky so i would not >>>>>>>> recommend without a strong knowledge of the codebase, and #5 is now >>>>>>>> deprecated. (I've just updated the algorithms grid to reflect this). >>>>>>>> The algorithms page includes both algorithms implemented in the >>>>>>>> math-scala library and algorithms which have CLI drivers written for >>>>>>>> them. >>>>>>>> >>>>>>>> Please see: http://mahout.apache.org/developers/how-to-contribute.html >>>>>>>> >>>>>>>> And please note that per that documentation, it is in everybody's best >>>>>>>> interest to keep messages on list, contacting committers directly is >>>>>>>> discouraged. >>>>>>>> >>>>>>>> The best way to contribute (if you have not found a new bug or issue) >>>>>>>> would be for you to pick a single open issue in the mahout JIRA which >>>>>>>> is not already assigned, and start work on it. When your work is >>>>>>>> ready for review, just open up a PR and the committers will review it. >>>>>>>> Please note that if you do pick up an issue to work on, we do expect >>>>>>>> some amount of responsibility and reliability and tangible amount of >>>>>>>> satisfactory work since once you've marked a JIRA as something you're >>>>>>>> working on, others will pass on it. >>>>>>>> >>>>>>>> Another good way to contribute would be to look for enhancements that >>>>>>>> could make to existing code not necessarily open JIRAs that need to be >>>>>>>> assigned to you. For example please see the recent contribution and >>>>>>>> workflow on: https://issues.apache.org/jira/browse/MAHOUT-1833 . >>>>>>>> >>>>>>>> If you have something new that you'd like to implement, simply start a >>>>>>>> new JIRA issue and begin work on it. In this case, when you have some >>>>>>>> code that is ready for review, you can simply open up a PR for it and >>>>>>>> committers will review it. For new implementations, we generally say >>>>>>>> that you should do this when you are at least 70-80% finished with >>>>>>>> your coding. >>>>>>>> >>>>>>>> Thank You, >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ________________________________________ >>>>>>>> From: Saikat Kanjilal <sxk1...@hotmail.com> >>>>>>>> Sent: Tuesday, April 26, 2016 7:17 PM >>>>>>>> To: dev@mahout.apache.org >>>>>>>> Subject: RE: Mahout contributions >>>>>>>> >>>>>>>> Hello,Following up on my last email with more specifics, I've looked >>>>>>>> through the wiki >>>>>>>> (https://mahout.apache.org/users/basics/algorithms.html) and I'm >>>>>>>> interested in implementing the one or more of the following algorithms >>>>>>>> with Mahout using spark: 1) Matrix Factorization with ALS 2) Naive >>>>>>>> Bayes 3) Weighted Matrix Factorization, SVD++ 4) Sparse TF-IDF Vectors >>>>>>>> from Text 5) Lucene integration. >>>>>>>> Had a few questions:1) Which of these should I start with and where is >>>>>>>> there the greatest need?2) Should I fork the repo and create branches >>>>>>>> for the each of the above implementations?3) Should I go ahead and >>>>>>>> create some JIRAs for these? >>>>>>>> Would love to have some pointers to get started?Regards >>>>>>>> >>>>>>>> From: sxk1...@hotmail.com >>>>>>>> To: dev@mahout.apache.org >>>>>>>> Subject: Mahout contributions >>>>>>>> Date: Wed, 30 Mar 2016 10:23:45 -0700 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hello Committers,I was looking through the current jira tickets and >>>>>>>> was wondering if there's a particular area of Mahout that needs some >>>>>>>> more help than others, should I focus on contributing some algorithms >>>>>>>> usign DSL or Samsara related efforts, I've finally got some bandwidth >>>>>>>> to do some work and would love some guidance before assigning myself >>>>>>>> some tickets.Regards >>> >