Re: Mahout contributions

Khurrum Nasim Mon, 02 May 2016 06:56:49 -0700

@Saikat - One thing I shall say is that REST is slow.  There is latency because 
of deserialization overhead.  For very large datasets probably not very good to 
use REST.



> On Apr 30, 2016, at 2:35 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
> 
> Andrew et al,I wanted to ask about a few items while I'm researching my dev 
> proposal, so what I'm looking to build is a streaming analytics platform to 
> do things like collaborative filtering and anomaly detection on large amounts 
> of streaming data that are either generated from events (kafka) or through a 
> firehose like Amazon Kinesis, my initial thinking is that this pipe of 
> events/data would be connected to a rest API that sits on top of mahout, the 
> backend underneath mahout would use a hybrid form of spark as well as spark 
> streaming, I'm wondering whether Samsara was designed from the ground up to 
> deal with large amounts of streaming data or whether this is not a use case 
> targeted yet.  My goal is to build a platform with several data sources/sinks 
> and produce intermediate checkpoints where transformations are applied to the 
> data before once again sending to a set of sinks/sources.  Therefore the 
> potential fits into and out of mahout include:
> 1) A rest API that leverages spray and akka and invokes one or more 
> algorithms in mahout2) A runtime environment with scala actors that allows 
> one to either ingest data or perform transformations on data through the use 
> of various classification and clustering algorithms, the runtime environment 
> would ingest algorithms using mahout as a library3) A rich set of actors 
> dealing with various no sql and graph based datastores 
> (cassandra/neo4j/titan/mongo)
> 
> Some insight into Samsara would be great as I'm trying to understand the 
> entry points into mahout.
> Thanks in advance.
> 
>> From: ap....@outlook.com
>> To: dev@mahout.apache.org
>> Subject: Re: Mahout contributions
>> Date: Thu, 28 Apr 2016 21:43:19 +0000
>> 
>> I don't  think that this sort of of integration work would be a good fit 
>> directly to the Mahout project.  Mahout is more about math, algorithms and 
>> an environment to develop algorithms.  We stay away from direct platform 
>> integration.  In the past we did have some elasticsearch/mahout integration 
>> work that is not in the code base for this exact reason.  I would suggest 
>> that better places to contribute something like this may be: PIO 
>> (https://prediction.io/), or even directly as a package for spark 
>> http://spark-packages.org/ .
>> 
>> Recent projects integrating Mahout have recently been added to PIO: 
>> https://github.com/PredictionIO/template-scala-parallel-universal-recommendation.
>>   
>> 
>> I think that the project that you are proposing would be a better fit there.
>> 
>> Thanks,
>> 
>> Andy
>> 
>> 
>> ________________________________________
>> From: Saikat Kanjilal <sxk1...@hotmail.com>
>> Sent: Thursday, April 28, 2016 1:50 PM
>> To: dev@mahout.apache.org
>> Subject: Re: Mahout contributions
>> 
>> I want to start with social data as an example, for example data returned 
>> from FB graph API as well user Twitter data, will send some samples later if 
>> you're interested.
>> 
>> Sent from my iPhone
>> 
>>> On Apr 28, 2016, at 10:41 AM, Khurrum Nasim <khurrum.na...@useitc.com> 
>>> wrote:
>>> 
>>> 
>>> What type of JSON payload size are we talking about here ?
>>> 
>>>> On Apr 28, 2016, at 1:32 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
>>>> 
>>>> Because EL gives you the visualization and non Lucene type query 
>>>> constructs as well and also that it already has a rest API that I plan on 
>>>> tying into mahout.  I plan on wrapping some of the clustering algorithms 
>>>> that I implement using Mahout and Spark as a service which can then make 
>>>> calls into other services (namely elasticsearch and neo4j graph service).
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On Apr 28, 2016, at 10:22 AM, Khurrum Nasim <khurrum.na...@useitc.com> 
>>>>> wrote:
>>>>> 
>>>>> @Saikat- why use EL instead of Lucene directly.
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Apr 28, 2016, at 12:08 PM, Saikat Kanjilal <sxk1...@hotmail.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> This is great information thank you, based on this recommendation I 
>>>>>> won't create a JIRA but start work on my project and when the code 
>>>>>> approaches the percentages you are describing I will create the 
>>>>>> appropriate JIRA's and put together a proposal to send to the list, 
>>>>>> sound ok?  Based on your latest updates to the wiki i will work on a 
>>>>>> handful of the clustering algorithms since I see that the Spark 
>>>>>> implementations for these are not yet complete.
>>>>>> Thank you again
>>>>>> 
>>>>>>> From: ap....@outlook.com
>>>>>>> To: dev@mahout.apache.org
>>>>>>> Subject: Re: Mahout contributions
>>>>>>> Date: Thu, 28 Apr 2016 01:31:09 +0000
>>>>>>> 
>>>>>>> Saikat,
>>>>>>> 
>>>>>>> One other thing that I should say is that you do not need clearance or 
>>>>>>> input from the committers to begin work on your project, and the 
>>>>>>> interest can and should come from the community as a whole. You can 
>>>>>>> write proposal as you've done, and if you don't see any "+1"s or 
>>>>>>> responses from the community at whole with in a few days, you may want 
>>>>>>> to explain in more detail, give examples and use cases.  If you are 
>>>>>>> still not seeing +1s or any responses from others then I think you can 
>>>>>>> assume that there may not be interest; this is usually how things work.
>>>>>>> 
>>>>>>> However if its something that your passionate about and you feel like 
>>>>>>> you can deliver this should not to stop you.  People do not always read 
>>>>>>> the dev@ emails or have time to respond.  You can still move forward 
>>>>>>> with your proposed contribution by following the steps laid out in my 
>>>>>>> previous email; follow the protocol at:
>>>>>>> 
>>>>>>> http://mahout.apache.org/developers/how-to-contribute.html
>>>>>>> 
>>>>>>> and create a JIRA.  When you have reached a significant amount of 
>>>>>>> completion (around 70-80%), open a PR for review, this way you can 
>>>>>>> explain in more detail.
>>>>>>> 
>>>>>>> But please realize that when you open a JIRA for a new issue there is 
>>>>>>> some expectation of a commitment on your part to complete it.
>>>>>>> 
>>>>>>> For example, I am currently investigating some new plotting features.  
>>>>>>> I have spent a good deal of time this week and last already and am even 
>>>>>>> mocking up code as a sketch of what may become an implementation before 
>>>>>>> I open a "New Feature" JIRA for it.
>>>>>>> 
>>>>>>> My point is absolutely not to discourage you or anybody else from 
>>>>>>> opening JIRAs for new features, rather to let you know that when you 
>>>>>>> open an JIRA for a new issue, It tells others that your are working on 
>>>>>>> it, and thus may discourage another with a similar idea to contribute 
>>>>>>> this feature.  So it is best to open it once you've begun your work and 
>>>>>>> are committed to it.
>>>>>>> 
>>>>>>> Andy
>>>>>>> 
>>>>>>> ________________________________________
>>>>>>> From: Saikat Kanjilal <sxk1...@hotmail.com>
>>>>>>> Sent: Wednesday, April 27, 2016 8:24 PM
>>>>>>> To: dev@mahout.apache.org
>>>>>>> Subject: RE: Mahout contributions
>>>>>>> 
>>>>>>> Andrew,Thank you very much for your input, I actually want to start a 
>>>>>>> new set of JIRAs, here's what I want to work on, I want to build a 
>>>>>>> framework that ties together search/visualization capability with some 
>>>>>>> machine learning algorithms, so essentially think of it as tying in 
>>>>>>> elasticsearch and kibana  into mahout , the user can search for their 
>>>>>>> data with elasticsearch and for deeper analysis on that data they can 
>>>>>>> feed that data into one or more mahout backends for analysis.  Another 
>>>>>>> interesting tie in might be to hack kibana to render ggplot like 
>>>>>>> graphics based on the output of mahout algorithms (assuming this can be 
>>>>>>> a kibana plugin).
>>>>>>> Before I go hog wild to create a bunch of JIRA's I'd like to know if 
>>>>>>> there's interest in this initiative.  The tool will bring together the 
>>>>>>> ELK stack with dynamic machine learning algorithms.  I can go into a 
>>>>>>> lot more detail around use cases if there's enough interest.
>>>>>>> Looking forward to your and other committers input.Thanks
>>>>>>> 
>>>>>>>> From: ap....@outlook.com
>>>>>>>> To: dev@mahout.apache.org
>>>>>>>> Subject: Re: Mahout contributions
>>>>>>>> Date: Wed, 27 Apr 2016 20:16:38 +0000
>>>>>>>> 
>>>>>>>> Hello Saikat,
>>>>>>>> 
>>>>>>>> #1 and #2 above are already implemented.  #4 is tricky so i would not 
>>>>>>>> recommend without a strong knowledge of the codebase, and #5 is now 
>>>>>>>> deprecated.  (I've just updated the algorithms grid to reflect this).  
>>>>>>>> The algorithms page includes both algorithms implemented in the 
>>>>>>>> math-scala library and algorithms which have CLI drivers written for 
>>>>>>>> them.
>>>>>>>> 
>>>>>>>> Please see: http://mahout.apache.org/developers/how-to-contribute.html
>>>>>>>> 
>>>>>>>> And please note that per that documentation, it is in everybody's best 
>>>>>>>> interest to keep messages on list, contacting committers directly is 
>>>>>>>> discouraged.
>>>>>>>> 
>>>>>>>> The best way to contribute (if you have not found a new bug or issue) 
>>>>>>>> would be for you to pick a single open issue in the mahout JIRA which 
>>>>>>>> is not already assigned, and start work on it.  When your work is 
>>>>>>>> ready for review, just open up a PR and the committers will review it. 
>>>>>>>>  Please note that if you do pick up an issue to work on, we do expect 
>>>>>>>> some amount of responsibility and reliability and tangible amount of 
>>>>>>>> satisfactory work since once you've marked a JIRA as something you're 
>>>>>>>> working on, others will pass on it.
>>>>>>>> 
>>>>>>>> Another good way to contribute would be to look for enhancements that 
>>>>>>>> could make to existing code not necessarily open JIRAs that need to be 
>>>>>>>> assigned to you.  For example please see the recent contribution and 
>>>>>>>> workflow on: https://issues.apache.org/jira/browse/MAHOUT-1833 .
>>>>>>>> 
>>>>>>>> If you have something new that you'd like to implement, simply start a 
>>>>>>>> new JIRA issue and begin work on it.  In this case, when you have some 
>>>>>>>> code that is ready for review,  you can simply open up a PR for it and 
>>>>>>>> committers will review it.  For new implementations, we generally say 
>>>>>>>> that you should do this when you are at least 70-80% finished with 
>>>>>>>> your coding.
>>>>>>>> 
>>>>>>>> Thank You,
>>>>>>>> 
>>>>>>>> Andy
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ________________________________________
>>>>>>>> From: Saikat Kanjilal <sxk1...@hotmail.com>
>>>>>>>> Sent: Tuesday, April 26, 2016 7:17 PM
>>>>>>>> To: dev@mahout.apache.org
>>>>>>>> Subject: RE: Mahout contributions
>>>>>>>> 
>>>>>>>> Hello,Following up on my last email with more specifics,  I've looked 
>>>>>>>> through the wiki 
>>>>>>>> (https://mahout.apache.org/users/basics/algorithms.html) and I'm 
>>>>>>>> interested in implementing the one or more of the following algorithms 
>>>>>>>> with Mahout using spark: 1) Matrix Factorization with ALS 2) Naive 
>>>>>>>> Bayes 3) Weighted Matrix Factorization, SVD++ 4) Sparse TF-IDF Vectors 
>>>>>>>> from Text 5) Lucene integration.
>>>>>>>> Had a few questions:1) Which of these should I start with and where is 
>>>>>>>> there the greatest need?2) Should I fork the repo and create branches 
>>>>>>>> for the each of the above implementations?3) Should I go ahead and 
>>>>>>>> create some JIRAs for these?
>>>>>>>> Would love to have some pointers to get started?Regards
>>>>>>>> 
>>>>>>>> From: sxk1...@hotmail.com
>>>>>>>> To: dev@mahout.apache.org
>>>>>>>> Subject: Mahout contributions
>>>>>>>> Date: Wed, 30 Mar 2016 10:23:45 -0700
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hello Committers,I was looking through the current jira tickets and 
>>>>>>>> was wondering if there's a particular area of Mahout that needs some 
>>>>>>>> more help than others, should I focus on contributing some algorithms 
>>>>>>>> usign DSL or Samsara related efforts, I've finally got some bandwidth 
>>>>>>>> to do some work and would love some guidance before assigning myself 
>>>>>>>> some tickets.Regards
>>> 
>

Re: Mahout contributions

Reply via email to