yes... i think it worked for me before, although just adding all jars from R package distribution would be a little bit more appropriate approach -- but it creates a problem with jars in dependent R packages. I think it would be much easier to just compile a hadoop-job file and stick it in rather than doing cherry-picking of individual jars from who knows how many locations.
i think i used the hadoop job format with distributed cache before and it worked... at least with Pig "register jar" functionality. ok i guess i will just try if it works. On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <[email protected]> wrote: > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> Great! so it is in Crunch. >> >> does it support hadoop-job jar format or only pure java jars? >> > > I think just pure jars-- you're referring to hadoop-job format as having > all the dependencies in a lib/ directory within the jar? > > >> >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <[email protected]> wrote: >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >> > >> >> I think i need functionality to add more jars (or external hadoop-jar) >> >> to drive that from an R package. Just setting job jar by class is not >> >> enough. I can push overall job-jar as an addiitonal jar to R package; >> >> however, i cannot really run hadoop command line on it, i need to set >> >> up classpath thru RJava. >> >> >> >> Traditional single hadoop job jar will unlikely work here since we >> >> cannot hardcode pipelines in java code but rather have to construct >> >> them on the fly. (well, we could serialize pipeline definitions from R >> >> and then replay them in a driver -- but that's too cumbersome and more >> >> work than it has to be.) There's no reason why i shouldn't be able to >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking >> >> off a pipeline. >> >> >> > >> > o.a.c.util.DistCache.addJarToDistributedCache? >> > >> > >> >> >> >> >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <[email protected]> >> >> wrote: >> >> > Ok, sounds very promising... >> >> > >> >> > i'll try to start digging on the driver part this week then (Pipeline >> >> > wrapper in R5). >> >> > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <[email protected]> >> >> wrote: >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <[email protected] >> > >> >> wrote: >> >> >>> Ok, cool. >> >> >>> >> >> >>> So what state is Crunch in? I take it is in a fairly advanced state. >> >> >>> So every api mentioned in the FlumeJava paper is working , right? >> Or >> >> >>> there's something that is not working specifically? >> >> >> >> >> >> I think the only thing in the paper that we don't have in a working >> >> >> state is MSCR fusion. It's mostly just a question of prioritizing it >> >> >> and getting the work done. >> >> >> >> >> >>> >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <[email protected]> >> >> wrote: >> >> >>>> Hey Dmitriy, >> >> >>>> >> >> >>>> Got a fork going and looking forward to playing with crunchR this >> >> weekend-- >> >> >>>> thanks! >> >> >>>> >> >> >>>> J >> >> >>>> >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov < >> [email protected]> >> >> wrote: >> >> >>>> >> >> >>>>> Project template https://github.com/dlyubimov/crunchR >> >> >>>>> >> >> >>>>> Default profile does not compile R artifact . R profile compiles R >> >> >>>>> artifact. for convenience, it is enabled by supplying -DR to mvn >> >> >>>>> command line, e.g. >> >> >>>>> >> >> >>>>> mvn install -DR >> >> >>>>> >> >> >>>>> there's also a helper that installs the snapshot version of the >> >> >>>>> package in the crunchR module. >> >> >>>>> >> >> >>>>> There's RJava and JRI java dependencies which i did not find >> anywhere >> >> >>>>> in public maven repos; so it is installed into my github maven >> repo >> >> so >> >> >>>>> far. Should compile for 3rd party. >> >> >>>>> >> >> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R Doc >> >> >>>>> compilation requires roxygen2 (i think). >> >> >>>>> >> >> >>>>> For some reason RProtoBuf fails to import into another package, >> got a >> >> >>>>> weird exception when i put @import RProtoBuf into crunchR, so >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road that may >> be a >> >> >>>>> problem though... >> >> >>>>> >> >> >>>>> other than the template, not much else has been done so far... >> >> finding >> >> >>>>> hadoop libraries and adding it to the package path on >> initialization >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its >> non-"provided" >> >> >>>>> transitives to the crunchR's java part... >> >> >>>>> >> >> >>>>> No legal stuff... >> >> >>>>> >> >> >>>>> No readmes... complete stealth at this point. >> >> >>>>> >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov < >> >> [email protected]> >> >> >>>>> wrote: >> >> >>>>> > Ok, cool. I will try to roll project template by some time next >> >> week. >> >> >>>>> > we can start with prototyping and benchmarking something really >> >> >>>>> > simple, such as parallelDo(). >> >> >>>>> > >> >> >>>>> > My interim goal is to perhaps take some more or less simple >> >> algorithm >> >> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or >> >> whatever >> >> >>>>> > name it has to be) in a comparable time (performance) but with >> much >> >> >>>>> > fewer lines of code. (say one of factorization or clustering >> >> things) >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <[email protected]> >> wrote: >> >> >>>>> >> I am not much of R user but I am interested to see how well we >> can >> >> >>>>> integrate >> >> >>>>> >> the two. I would be happy to help. >> >> >>>>> >> >> >> >>>>> >> regards, >> >> >>>>> >> Rahul >> >> >>>>> >> >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: >> >> >>>>> >>> >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov < >> >> [email protected]> >> >> >>>>> >>> wrote: >> >> >>>>> >>>> >> >> >>>>> >>>> Yep, ok. >> >> >>>>> >>>> >> >> >>>>> >>>> I imagine it has to be an R module so I can set up a maven >> >> project >> >> >>>>> >>>> with java/R code tree (I have been doing that a lot lately). >> Or >> >> if you >> >> >>>>> >>>> have a template to look at, it would be useful i guess too. >> >> >>>>> >>> >> >> >>>>> >>> No, please go right ahead. >> >> >>>>> >>> >> >> >>>>> >>>> >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills < >> >> [email protected]> >> >> >>>>> wrote: >> >> >>>>> >>>>> >> >> >>>>> >>>>> I'd like it to be separate at first, but I am happy to help. >> >> Github >> >> >>>>> >>>>> repo? >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" < >> [email protected] >> >> > >> >> >>>>> wrote: >> >> >>>>> >>>>> >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava prototype on >> >> top of >> >> >>>>> >>>>>> Crunch for something simple. This should both save time and >> >> prove or >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is viable. >> >> >>>>> >>>>>> >> >> >>>>> >>>>>> On my part i can try to do it within Crunch framework or we >> >> can keep >> >> >>>>> >>>>>> it completely separate. >> >> >>>>> >>>>>> >> >> >>>>> >>>>>> -d >> >> >>>>> >>>>>> >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills < >> >> [email protected]> >> >> >>>>> >>>>>> wrote: >> >> >>>>> >>>>>>> >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who gave the >> >> talk? Was >> >> >>>>> it >> >> >>>>> >>>>>>> Murray Stokely? >> >> >>>>> >>>>>>> >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov < >> >> >>>>> [email protected]> >> >> >>>>> >>>>>> >> >> >>>>> >>>>>> wrote: >> >> >>>>> >>>>>>>> >> >> >>>>> >>>>>>>> Hello, >> >> >>>>> >>>>>>>> >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's experience of R >> >> mapping >> >> >>>>> of >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot of >> >> applications >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be prototyped using >> >> flume R. >> >> >>>>> >>>>>>>> >> >> >>>>> >>>>>>>> I did not quite get the details of Google implementation >> of >> >> R >> >> >>>>> >>>>>>>> mapping, >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from R to >> Crunch >> >> would >> >> >>>>> be >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient). RJava/JRI and >> >> jni >> >> >>>>> seem to >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that directly. >> >> >>>>> >>>>>>>> >> >> >>>>> >>>>>>>> >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project could have a >> >> >>>>> contributed >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that would be >> >> just a >> >> >>>>> very >> >> >>>>> >>>>>>>> good synergy. >> >> >>>>> >>>>>>>> >> >> >>>>> >>>>>>>> Is there anyone interested in contributing/advising for >> open >> >> >>>>> source >> >> >>>>> >>>>>>>> version of flume R support? Just gauging interest, Crunch >> >> list >> >> >>>>> seems >> >> >>>>> >>>>>>>> like a natural place to poke. >> >> >>>>> >>>>>>>> >> >> >>>>> >>>>>>>> Thanks . >> >> >>>>> >>>>>>>> >> >> >>>>> >>>>>>>> -Dmitriy >> >> >>>>> >>>>>>> >> >> >>>>> >>>>>>> >> >> >>>>> >>>>>>> >> >> >>>>> >>>>>>> -- >> >> >>>>> >>>>>>> Director of Data Science >> >> >>>>> >>>>>>> Cloudera >> >> >>>>> >>>>>>> Twitter: @josh_wills >> >> >>>>> >>> >> >> >>>>> >>> >> >> >>>>> >>> >> >> >>>>> >> >> >> >>>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> -- >> >> >>>> Director of Data Science >> >> >>>> Cloudera <http://www.cloudera.com> >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >> >> >> > >> > >> > >> > -- >> > Director of Data Science >> > Cloudera <http://www.cloudera.com> >> > Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills>
