Finally, the whole point of ML environment is to enable pipeline customization. Mahout's major criticism is mostly that -- "we can't integrate and customize pipelines using Mahout's methods becasue Mahout's throws "us" into bash environment(only) to do that, and that's silly".
So the question is always about how we connect building blocks, how we do customized (cross)validation rounds etc. etc. I think we consistently heard that. So the main successful argument here is that programming environment is primary, and everything else is secondary. Supporting notions are that environment is an existing accepted environment with sufficient 3rd party following rather than new one (i.e. scala in our case) and that there's no mix of environments (such as in Pig/Pig UDF conundrum). So sure, just to try things out, one wants just to call a method with a predefined input and output locations. But as soon as the "kicking the tires" stage ends, one wants to do tons of other things as pre and post to the method (e.g. grabbing the latest time-stamped hdfs input rather than a predefined hardcoded constant) etc. etc. or even combine a bunch of methods (e.g. LSA pipeline). Assuming we operate on constrained resource schedule, i'd just go after prime priorities first. I would not oppose if somebody spent time building CLIs and CLI-based tutorial of course -- I just don't think we realistically have people willing to do that.. On Tue, Apr 15, 2014 at 11:14 AM, Dmitriy Lyubimov <[email protected]>wrote: > > > > On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <[email protected]>wrote: > >> Sorry you are sick. Thanks for the tip. Spark has a client launcher >> method "spark-class …Client launch ..." but I’m not having much success >> with that. >> > > This will not work because you need Mahout's classpath too. And Spark's. > The complexity here is the damn jar dependencies. Anything Spark (or hadoop > for that matter, too) CLI do is assume that application is so simple it can > fit into single jar and will have 0 external dependencies. I can do my own > rant about it for ages. > > So. the task here is to collect all Spark jars and its dependencies; merge > that of the same of Mahout's, perhaps filtering in only what is really > needed in spark-based pipelines, and then run it. It is what specialized > mahoutContext() api does, and there's a crapload of scala code devoted just > to this single issue of deducing and grabing dependencies and make sure > Spark takes them. > > Hope this clarifies why Spark helpers' ways of starting standalone spark > applications just are not helpful for us. (or anyone, to be frank. I > participated in a healhful dozen of spark-based projects, and none of them > could use these helpers like Client or spark-class.sh for the same reason > -- they had to do their own bootstrap routine). > > So... we will have to have our own helpers to do that . I wonder if > there's a similar syntax for mahout already, something like "mahout > run-class <class-name>". Since i never used that, i don't know for sure, > but hadoop subordinate projects all usually have that (e.g. there's an > 'hbase <class-name>" to run any class in hbase code base with proper > classpath dependencies taken care of). > > > > >> >> As to the statement "There is not, nor do i think there will be a way to >> run this stuff with CLI” seems unduly misleading. Really, does anyone >> second this? >> >> There will be Scala scripts to drive this stuff and yes even from the >> CLI. Do you imagine that every Mahout USER will be a Scala + Mahout DSL >> programmer? That may be fine for commiters but users will be PHP devs, Ruby >> devs, Python or Java devs maybe even a few C# devs. I think you are >> confusing Mahout DEVS with USERS. Few users are R devs moving into >> production work, they are production engineers moving into ML who want a >> blackbox. They will need a language agnostic way to drive Mahout. Making >> statements like this only confuse potential users and drive them away to no >> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in the >> typical user’s world view. >> >> Sorry, end-of-rant. >> >> On Apr 15, 2014, at 10:14 AM, Dmitriy Lyubimov (JIRA) <[email protected]> >> wrote: >> >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969763#comment-13969763] >> >> Dmitriy Lyubimov commented on MAHOUT-1464: >> ------------------------------------------ >> >> [My] Silence idicates I've been pretty sick :) >> >> I thought i explained in my email we are not planning CLI. We are >> planning script shell instead. There is not, nor do i think there will be a >> way to run this stuff with CLI, just like there's no way to invoke a >> particular method in R without writing a short script. >> >> That said, yes, you can try to run it as a java application, i.e. >> [java|scala] -cp <cp>. <class name> >> >> where -cp is what `mahout classpath` returns. >> >> > Cooccurrence Analysis on Spark >> > ------------------------------ >> > >> > Key: MAHOUT-1464 >> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 >> > Project: Mahout >> > Issue Type: Improvement >> > Components: Collaborative Filtering >> > Environment: hadoop, spark >> > Reporter: Pat Ferrel >> > Assignee: Sebastian Schelter >> > Fix For: 1.0 >> > >> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, >> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, >> run-spark-xrsj.sh >> > >> > >> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) >> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so >> a DRM can be used as input. >> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence >> has several applications including cross-action recommendations. >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v6.2#6252) >> >> >
