No, we probably don't want to create them unless we have someone to assign them to. You are more than welcome create one if you want to take a stub at any of those.
-d On Wed, Mar 26, 2014 at 10:28 AM, Saikat Kanjilal <[email protected]>wrote: > @DmitryAre there JIRA items created for the wanted pieces? I'd like to > volunteer to take on the shell and the R bindings , should I create JIRA > items for these? > > > Date: Wed, 26 Mar 2014 10:12:01 -0700 > > Subject: Re: Mahout on Spark > > From: [email protected] > > To: [email protected] > > CC: [email protected] > > > > Sure. > > > > @Saikat et al: > > > > Check out the http://mahout.apache.org/users/sparkbindings/home.html"Wanted" > > section. > > > > Of course, data frames and vectorization(feature prep) standardization is > > very high priority there. > > Another high priority is interactive shell /scripting (just like spark > > shell). Something very similar in R interactive/script runner mode in > > spirit. It is very important. > > > > Re: data frames. Anyone familiar with R, knows what it is. Basically a > set > > of named columnar vectors (with rows named or enumerated as well). A set > of > > filtering/modifying DSL expressions similar to R (I haven't really > thought > > about it at depth). The tricky part here is in-core data frame support of > > course, since data frames are based on vectors that go beyond just a real > > (double) values we have right now. in R, vector values could be integral, > > boolean and character(i.e.string) types as well. If we had an in-core > > support for that (or borrowed it from somewhere), the rest would have > been > > easy -- it is just a matter of semantic elegance. Really, i suggest to > look > > at R paradigms there, it is a pretty elegant way to work with closures > > there. > > > > Of course we could use off-the-shelf stuff such as Map's to support > > something named, with string values. I don't know at this point. Scala > > itself comes a long way to help out here. > > > > As for slides, they are of little interest themselves since they mostly > > re-interpret and summarize the working notes pdf in a bit more palatable > > way. It is just an opportunity to deliver some content to folks who shy > > away from reading docs for some reason *wink wink*. I will put them on > the > > site after meetup if it is ok. > > > > > > > > > > On Wed, Mar 26, 2014 at 9:09 AM, Saikat Kanjilal <[email protected] > >wrote: > > > > > +1, in fact I would be very much indebted if someone (namely Dmitry :) > ) > > > could do a google hangout focused on spark where folks can ask > questions > > > and learn more, to this end I want to bring up something else, it'd be > > > great if mahout itself either through the apache project foundation or > > > through committer means have a hadoop cluster to test algorithms, it > seems > > > like folks have their own cluster to test on but I think it'd be a > benefit > > > to the community to have a cluster that everyone can leverage. > > > > > > > Subject: Mahout on Spark > > > > From: [email protected] > > > > Date: Wed, 26 Mar 2014 09:05:02 -0700 > > > > To: [email protected]; [email protected] > > > > > > > > > > > New name for a new thread. > > > > > > > > A lot of the discussion on MAHOUT-1464 has been around integrating > that > > > feature with the Scala DSL. As Saikat says this is of general interest > > > since people seem to agree that this is a good place to integrate > efforts. > > > > > > > > I'm interested in what I think Dmitriy called data frames. Being a > > > complete noob on Spark I may have gotten this wrong but let me take a > shot > > > so he can correct me. > > > > > > > > There are a lot of problems that require a pipeline. The text input > > > pipeline is an example, but almost any input to Mahout requires at > least an > > > id translation step. What I though Dmitriy was suggesting was that by > > > avoiding the disk write + read between steps we might get significant > > > speedups. This has many implications, I'm sure. > > > > > > > > For one I think it means the non-serialized objects are being used by > > > multiple parts of the pipeline and so are not subject to "translation". > > > > > > > > Dmitriy can you explain more? You mentioned a talk you have given, do > > > you have slides somewhere or a PDF? > > > > > > > > > > > > On Mar 26, 2014, at 7:15 AM, Ted Dunning <[email protected]> > wrote: > > > > > > > > It would be great to have you. > > > > > > > > > > > > (go ahead and start new threads when appropriate ... better than > > > hijacking) > > > > > > > > > > > > On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya < > [email protected] > > > >wrote: > > > > > > > > > Sorry to hijack the thread, > > > > > > > > > > this seems like first steps of mahout geeting it to work on spark > > > > > > > > > > there are similar efforts going on with R+Spark aka Spark R > > > > > > > > > > not sure if this helpos, played with spark ec2 scripts and it > brings up > > > > > multinode cluster using mesos and its configurable - willing to > > > contribute > > > > > donations for mahout-dev > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) < > > > [email protected] > > > > >> wrote: > > > > > > > > > >> > > > > >> [ > > > > >> > > > > > > > > > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710 > > > > > ] > > > > >> > > > > >> Saikat Kanjilal commented on MAHOUT-1464: > > > > >> ----------------------------------------- > > > > >> > > > > >> +1 on Andrew's suggestion on using AWS to do this. Andrew is it > > > possible > > > > >> to have a shared account so mahout contributors can use this, I 'd > > > even > > > > > be > > > > >> willing to chip in donations :) to have a shared AWS account > > > > >> > > > > >>> RowSimilarityJob on Spark > > > > >>> ------------------------- > > > > >>> > > > > >>> Key: MAHOUT-1464 > > > > >>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > > > > >>> Project: Mahout > > > > >>> Issue Type: Improvement > > > > >>> Components: Collaborative Filtering > > > > >>> Affects Versions: 0.9 > > > > >>> Environment: hadoop, spark > > > > >>> Reporter: Pat Ferrel > > > > >>> Labels: performance > > > > >>> Fix For: 1.0 > > > > >>> > > > > >>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, > > > > >> MAHOUT-1464.patch > > > > >>> > > > > >>> > > > > >>> Create a version of RowSimilarityJob that runs on Spark. Ssc has > a > > > > >> prototype here: https://gist.github.com/sscdotopen/8314254. This > > > should > > > > >> be compatible with Mahout Spark DRM DSL so a DRM can be used as > input. > > > > >>> Ideally this would extend to cover MAHOUT-1422 which is a feature > > > > >> request for RSJ on two inputs to calculate the similarity of rows > of > > > one > > > > >> DRM with those of another. This cross-similarity has several > > > applications > > > > >> including cross-action recommendations. > > > > >> > > > > >> > > > > >> > > > > >> -- > > > > >> This message was sent by Atlassian JIRA > > > > >> (v6.2#6252) > > > > >> > > > > > > > > > > > > > >
