On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <[email protected]> wrote:
> I think i need functionality to add more jars (or external hadoop-jar) > to drive that from an R package. Just setting job jar by class is not > enough. I can push overall job-jar as an addiitonal jar to R package; > however, i cannot really run hadoop command line on it, i need to set > up classpath thru RJava. > > Traditional single hadoop job jar will unlikely work here since we > cannot hardcode pipelines in java code but rather have to construct > them on the fly. (well, we could serialize pipeline definitions from R > and then replay them in a driver -- but that's too cumbersome and more > work than it has to be.) There's no reason why i shouldn't be able to > do pig-like "register jar" or "setJobJar" (mahout-like) when kicking > off a pipeline. > o.a.c.util.DistCache.addJarToDistributedCache? > > > On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > Ok, sounds very promising... > > > > i'll try to start digging on the driver part this week then (Pipeline > > wrapper in R5). > > > > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <[email protected]> > wrote: > >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <[email protected]> > wrote: > >>> Ok, cool. > >>> > >>> So what state is Crunch in? I take it is in a fairly advanced state. > >>> So every api mentioned in the FlumeJava paper is working , right? Or > >>> there's something that is not working specifically? > >> > >> I think the only thing in the paper that we don't have in a working > >> state is MSCR fusion. It's mostly just a question of prioritizing it > >> and getting the work done. > >> > >>> > >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <[email protected]> > wrote: > >>>> Hey Dmitriy, > >>>> > >>>> Got a fork going and looking forward to playing with crunchR this > weekend-- > >>>> thanks! > >>>> > >>>> J > >>>> > >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <[email protected]> > wrote: > >>>> > >>>>> Project template https://github.com/dlyubimov/crunchR > >>>>> > >>>>> Default profile does not compile R artifact . R profile compiles R > >>>>> artifact. for convenience, it is enabled by supplying -DR to mvn > >>>>> command line, e.g. > >>>>> > >>>>> mvn install -DR > >>>>> > >>>>> there's also a helper that installs the snapshot version of the > >>>>> package in the crunchR module. > >>>>> > >>>>> There's RJava and JRI java dependencies which i did not find anywhere > >>>>> in public maven repos; so it is installed into my github maven repo > so > >>>>> far. Should compile for 3rd party. > >>>>> > >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R Doc > >>>>> compilation requires roxygen2 (i think). > >>>>> > >>>>> For some reason RProtoBuf fails to import into another package, got a > >>>>> weird exception when i put @import RProtoBuf into crunchR, so > >>>>> RProtoBuf is now in "Suggests" category. Down the road that may be a > >>>>> problem though... > >>>>> > >>>>> other than the template, not much else has been done so far... > finding > >>>>> hadoop libraries and adding it to the package path on initialization > >>>>> via "hadoop classpath"... adding Crunch jars and its non-"provided" > >>>>> transitives to the crunchR's java part... > >>>>> > >>>>> No legal stuff... > >>>>> > >>>>> No readmes... complete stealth at this point. > >>>>> > >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov < > [email protected]> > >>>>> wrote: > >>>>> > Ok, cool. I will try to roll project template by some time next > week. > >>>>> > we can start with prototyping and benchmarking something really > >>>>> > simple, such as parallelDo(). > >>>>> > > >>>>> > My interim goal is to perhaps take some more or less simple > algorithm > >>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or > whatever > >>>>> > name it has to be) in a comparable time (performance) but with much > >>>>> > fewer lines of code. (say one of factorization or clustering > things) > >>>>> > > >>>>> > > >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <[email protected]> wrote: > >>>>> >> I am not much of R user but I am interested to see how well we can > >>>>> integrate > >>>>> >> the two. I would be happy to help. > >>>>> >> > >>>>> >> regards, > >>>>> >> Rahul > >>>>> >> > >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: > >>>>> >>> > >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov < > [email protected]> > >>>>> >>> wrote: > >>>>> >>>> > >>>>> >>>> Yep, ok. > >>>>> >>>> > >>>>> >>>> I imagine it has to be an R module so I can set up a maven > project > >>>>> >>>> with java/R code tree (I have been doing that a lot lately). Or > if you > >>>>> >>>> have a template to look at, it would be useful i guess too. > >>>>> >>> > >>>>> >>> No, please go right ahead. > >>>>> >>> > >>>>> >>>> > >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills < > [email protected]> > >>>>> wrote: > >>>>> >>>>> > >>>>> >>>>> I'd like it to be separate at first, but I am happy to help. > Github > >>>>> >>>>> repo? > >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <[email protected] > > > >>>>> wrote: > >>>>> >>>>> > >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava prototype on > top of > >>>>> >>>>>> Crunch for something simple. This should both save time and > prove or > >>>>> >>>>>> disprove if Crunch via RJava integration is viable. > >>>>> >>>>>> > >>>>> >>>>>> On my part i can try to do it within Crunch framework or we > can keep > >>>>> >>>>>> it completely separate. > >>>>> >>>>>> > >>>>> >>>>>> -d > >>>>> >>>>>> > >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills < > [email protected]> > >>>>> >>>>>> wrote: > >>>>> >>>>>>> > >>>>> >>>>>>> I am an avid R user and would be into it-- who gave the > talk? Was > >>>>> it > >>>>> >>>>>>> Murray Stokely? > >>>>> >>>>>>> > >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov < > >>>>> [email protected]> > >>>>> >>>>>> > >>>>> >>>>>> wrote: > >>>>> >>>>>>>> > >>>>> >>>>>>>> Hello, > >>>>> >>>>>>>> > >>>>> >>>>>>>> I was pretty excited to learn of Google's experience of R > mapping > >>>>> of > >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot of > applications > >>>>> >>>>>>>> similar to what we do in Mahout could be prototyped using > flume R. > >>>>> >>>>>>>> > >>>>> >>>>>>>> I did not quite get the details of Google implementation of > R > >>>>> >>>>>>>> mapping, > >>>>> >>>>>>>> but i am not sure if just a direct mapping from R to Crunch > would > >>>>> be > >>>>> >>>>>>>> sufficient (and, for most part, efficient). RJava/JRI and > jni > >>>>> seem to > >>>>> >>>>>>>> be a pretty terrible performer to do that directly. > >>>>> >>>>>>>> > >>>>> >>>>>>>> > >>>>> >>>>>>>> on top of it, I am thinknig if this project could have a > >>>>> contributed > >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that would be > just a > >>>>> very > >>>>> >>>>>>>> good synergy. > >>>>> >>>>>>>> > >>>>> >>>>>>>> Is there anyone interested in contributing/advising for open > >>>>> source > >>>>> >>>>>>>> version of flume R support? Just gauging interest, Crunch > list > >>>>> seems > >>>>> >>>>>>>> like a natural place to poke. > >>>>> >>>>>>>> > >>>>> >>>>>>>> Thanks . > >>>>> >>>>>>>> > >>>>> >>>>>>>> -Dmitriy > >>>>> >>>>>>> > >>>>> >>>>>>> > >>>>> >>>>>>> > >>>>> >>>>>>> -- > >>>>> >>>>>>> Director of Data Science > >>>>> >>>>>>> Cloudera > >>>>> >>>>>>> Twitter: @josh_wills > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >> > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Director of Data Science > >>>> Cloudera <http://www.cloudera.com> > >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
