Re: Mahout 1.0 goals

Pat Ferrel Tue, 11 Mar 2014 09:55:25 -0700

Doing an example site for the solr-recommender Ted and I were faced with same 
choices you mention below. He and I chose quite different architectures, either 
of which is perfectly good.

I spent some time thinking about what the common integration points are for web 
apps. Solr supports a large community of web app integrators and works with 
about any data format and database out there. So in this special case virtually 
any wep app framework would have one or more methods for integrating with Solr.

Why not Mahout?

There at least two ends to web app integration, the input pipeline and serving 
the results. Not to mention  background potentially periodic model creation. 
The web app framework usually defines the way data is served (html, json, REST, 
the list of formats and protocols goes on) so let me put that aside for now. To 
me this points to getting data into mahout and out again. Ideally it should 
come in through an extremely flexible mechanism, which may also serve to get 
the data out.

Input and output is primarily about translating formats, Ids, and communicating 
with storage services (local fs, HDFS, S3, DB, …). I chose Cascading to process 
input in a mostly scalable way. Cascading does not yet have Schemas to support 
all the DBs so I build one for my DB (MongoDB) but it does support most file 
systems. There has been some talk in that community about adding Schemas for 
DBs, which is also possible to do yourself. It may be possible to create 
several of the more common pipelines all the way from reading data from a 
logfile, Cassandra, S3, etc through model creation to output to the web app’s 
primary store. This leaves it somewhat independent of the web app framework. If 
defined correctly if could have pluggable sink and source types and flexible 
format definitions.

Maybe there are better data pipeline frameworks than Cascading and making this 
work in 80% of use cases will be a fair amount of work but as long as Mahout 
has enough users it remains an important missing piece. 

I suspect that any reasonable attempt at this input to Mahout to datastore 
pipeline would be considered for inclusion or reference in Mahout-Examples.

On Mar 8, 2014, at 2:31 PM, Saikat Kanjilal <[email protected]> wrote:

Ok so the idea here is to tie and make some strategic partnerships with some 
other open source products and provide Mahout as one component of a web 
application, so the use cases for mahout will be partly driven by the use cases 
for the web application itself, so in a nutshell a web application requires: 1) 
search 2) recommendations 3) a primary data store.  The recommendations may be 
driven by the higher level use cases but the key piece here will be pushing 
mahout into delivering real time recommendations that someone can then perform 
searches over. One example might be to search for music recommendations like 
what spotify already does and perform term filters, term queries or other 
lucene based searches to deliver results.  Another might be to identify how 
recommendations fit into the rest endpoints or in the case of serviceizing 
mahout they can be rest endpoints.    I've been thinking about this for a while 
since lately I've seen a lot of discussions around mahout being hard to use or 
pick up and learn.   If there's enough interest I can go into more detail when 
we meet to discuss 1.0

> Date: Sat, 8 Mar 2014 11:44:53 +0100
> From: [email protected]
> To: [email protected]
> Subject: Re: Mahout 1.0 goals
> 
> Hm, can you elaborate more what you mean? IMHO Mahout is a library only, 
> so we should not build a complete MVC application inside this project, I 
> think this is something that people should build on top, like 
> prediction.io .
> 
> --sebastian
> 
> 
> On 03/08/2014 12:16 AM, Saikat Kanjilal wrote:
>> I was also wondering if there'd be any interest in building a plugin to 
>> interface with elasticsearch and spring, so what I am thinking is an MVC 
>> type service that performs lucene like searches on recommendation algorithm 
>> data stored inside a low latency data store, I know/saw that  there was a 
>> discussion on a solr recommender on mahout and would be glad to help 
>> lead/build an elasticsearch version.
>> 
>>> From: [email protected]
>>> Date: Fri, 7 Mar 2014 15:04:42 -0800
>>> Subject: Re: Mahout 1.0 goals
>>> To: [email protected]
>>> 
>>> There was not yet a meeting.
>>> 
>>> I owe the list a summary of what people said and some suggested
>>> roadmapping.  I will get to that on the weekend and we should be good for a
>>> hangout meeting sometime next week.
>>> 
>>> 
>>> 
>>> On Fri, Mar 7, 2014 at 10:35 AM, Saikat Kanjilal <[email protected]>wrote:
>>> 
>>>> Hey Guys,Been trying to follow with the 1.0 goals , was there already a
>>>> meeting on what the initial plans are for development and notes from that,
>>>> I am particualrly interested in deep learning and service-izing mahout ,
>>>> let me know.
>>>> Thanks
>>>> 
>>>>> From: [email protected]
>>>>> Date: Tue, 4 Mar 2014 19:32:40 -0800
>>>>> Subject: Re: Mahout 1.0 goals
>>>>> To: [email protected]; [email protected]
>>>>> 
>>>>> On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> - AFAIK its also a problem to ship it license-wise as the required
>>>>>> libraries would not be Apache licensed
>>>>>> 
>>>>>> See this discussion from the Spark community for details:
>>>>>> 
>>>>>> https://github.com/apache/incubator-spark/pull/575
>>>>>> 
>>>>> 
>>>>> This is a real issue and getting a lot of time over on legal as well.
>>>>> 
>>>>> A non-optional LGPL dependency doesn't fly at this time.
>>>> 
>>>> 
>>                                      
>> 
>

Re: Mahout 1.0 goals

Reply via email to