so java tasks need to be able to load libjri.so from
whatever system.file("jri", package="rJava") says.Traditionally, these issues were handled with -Djava.library.path. Apparently there's nothing java task can do to enable loadLibrary() command to see the damn library once started. But -Djava.library.path requires for nodes to configure and lock jvm command line from modifications of the client. which is fine. I also discovered that LD_LIBRARY_PATH actually works with jre 1.6 (again). but... any other suggestions about best practice configuring crunch to run user's .so's? thanks. On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <[email protected]> wrote: > I believe that is a safe assumption, at least right now. > > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > Question. > > > > So in Crunch api, initialize() doesn't get an emitter. and the process > gets > > emitter every time. > > > > However, my guess any single reincranation of a DoFn object in the > backend > > will always be getting the same emitter thru its lifecycle. Is it an > > admissible assumption or there's currently a counter example to that? > > > > The problem is that as i implement the two way pipeline of input and > > emitter data between R and Java, I am bulking these calls together for > > performance reasons. Each individual datum in these chunks of data will > not > > have attached emitter function information to them in any way. (well it > > could but it would be a performance killer and i bet emitter never > > changes). > > > > So, thoughts? can i assume emitter never changes between first and lass > > call to DoFn instance? > > > > thanks. > > > > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > yes... > > > > > > i think it worked for me before, although just adding all jars from R > > > package distribution would be a little bit more appropriate approach > > > -- but it creates a problem with jars in dependent R packages. I think > > > it would be much easier to just compile a hadoop-job file and stick it > > > in rather than doing cherry-picking of individual jars from who knows > > > how many locations. > > > > > > i think i used the hadoop job format with distributed cache before and > > > it worked... at least with Pig "register jar" functionality. > > > > > > ok i guess i will just try if it works. > > > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <[email protected]> > wrote: > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <[email protected] > > > > > wrote: > > > > > > > >> Great! so it is in Crunch. > > > >> > > > >> does it support hadoop-job jar format or only pure java jars? > > > >> > > > > > > > > I think just pure jars-- you're referring to hadoop-job format as > > having > > > > all the dependencies in a lib/ directory within the jar? > > > > > > > > > > > >> > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <[email protected]> > > > wrote: > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov < > > [email protected]> > > > >> wrote: > > > >> > > > > >> >> I think i need functionality to add more jars (or external > > > hadoop-jar) > > > >> >> to drive that from an R package. Just setting job jar by class is > > not > > > >> >> enough. I can push overall job-jar as an addiitonal jar to R > > package; > > > >> >> however, i cannot really run hadoop command line on it, i need to > > set > > > >> >> up classpath thru RJava. > > > >> >> > > > >> >> Traditional single hadoop job jar will unlikely work here since > we > > > >> >> cannot hardcode pipelines in java code but rather have to > construct > > > >> >> them on the fly. (well, we could serialize pipeline definitions > > from > > > R > > > >> >> and then replay them in a driver -- but that's too cumbersome and > > > more > > > >> >> work than it has to be.) There's no reason why i shouldn't be > able > > to > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when > > kicking > > > >> >> off a pipeline. > > > >> >> > > > >> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache? > > > >> > > > > >> > > > > >> >> > > > >> >> > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov < > > > [email protected]> > > > >> >> wrote: > > > >> >> > Ok, sounds very promising... > > > >> >> > > > > >> >> > i'll try to start digging on the driver part this week then > > > (Pipeline > > > >> >> > wrapper in R5). > > > >> >> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills < > > [email protected] > > > > > > > >> >> wrote: > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov < > > > [email protected] > > > >> > > > > >> >> wrote: > > > >> >> >>> Ok, cool. > > > >> >> >>> > > > >> >> >>> So what state is Crunch in? I take it is in a fairly advanced > > > state. > > > >> >> >>> So every api mentioned in the FlumeJava paper is working , > > > right? > > > >> Or > > > >> >> >>> there's something that is not working specifically? > > > >> >> >> > > > >> >> >> I think the only thing in the paper that we don't have in a > > > working > > > >> >> >> state is MSCR fusion. It's mostly just a question of > > prioritizing > > > it > > > >> >> >> and getting the work done. > > > >> >> >> > > > >> >> >>> > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills < > > [email protected] > > > > > > > >> >> wrote: > > > >> >> >>>> Hey Dmitriy, > > > >> >> >>>> > > > >> >> >>>> Got a fork going and looking forward to playing with crunchR > > > this > > > >> >> weekend-- > > > >> >> >>>> thanks! > > > >> >> >>>> > > > >> >> >>>> J > > > >> >> >>>> > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov < > > > >> [email protected]> > > > >> >> wrote: > > > >> >> >>>> > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR > > > >> >> >>>>> > > > >> >> >>>>> Default profile does not compile R artifact . R profile > > > compiles R > > > >> >> >>>>> artifact. for convenience, it is enabled by supplying -DR > to > > > mvn > > > >> >> >>>>> command line, e.g. > > > >> >> >>>>> > > > >> >> >>>>> mvn install -DR > > > >> >> >>>>> > > > >> >> >>>>> there's also a helper that installs the snapshot version of > > the > > > >> >> >>>>> package in the crunchR module. > > > >> >> >>>>> > > > >> >> >>>>> There's RJava and JRI java dependencies which i did not > find > > > >> anywhere > > > >> >> >>>>> in public maven repos; so it is installed into my github > > maven > > > >> repo > > > >> >> so > > > >> >> >>>>> far. Should compile for 3rd party. > > > >> >> >>>>> > > > >> >> >>>>> -DR compilation requires R, RJava and optionally, > RProtoBuf. > > R > > > Doc > > > >> >> >>>>> compilation requires roxygen2 (i think). > > > >> >> >>>>> > > > >> >> >>>>> For some reason RProtoBuf fails to import into another > > package, > > > >> got a > > > >> >> >>>>> weird exception when i put @import RProtoBuf into crunchR, > so > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road that > > may > > > >> be a > > > >> >> >>>>> problem though... > > > >> >> >>>>> > > > >> >> >>>>> other than the template, not much else has been done so > > far... > > > >> >> finding > > > >> >> >>>>> hadoop libraries and adding it to the package path on > > > >> initialization > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its > > > >> non-"provided" > > > >> >> >>>>> transitives to the crunchR's java part... > > > >> >> >>>>> > > > >> >> >>>>> No legal stuff... > > > >> >> >>>>> > > > >> >> >>>>> No readmes... complete stealth at this point. > > > >> >> >>>>> > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov < > > > >> >> [email protected]> > > > >> >> >>>>> wrote: > > > >> >> >>>>> > Ok, cool. I will try to roll project template by some > time > > > next > > > >> >> week. > > > >> >> >>>>> > we can start with prototyping and benchmarking something > > > really > > > >> >> >>>>> > simple, such as parallelDo(). > > > >> >> >>>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or less > simple > > > >> >> algorithm > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch > > (or > > > >> >> whatever > > > >> >> >>>>> > name it has to be) in a comparable time (performance) but > > > with > > > >> much > > > >> >> >>>>> > fewer lines of code. (say one of factorization or > > clustering > > > >> >> things) > > > >> >> >>>>> > > > > >> >> >>>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul < > [email protected] > > > > > > >> wrote: > > > >> >> >>>>> >> I am not much of R user but I am interested to see how > > well > > > we > > > >> can > > > >> >> >>>>> integrate > > > >> >> >>>>> >> the two. I would be happy to help. > > > >> >> >>>>> >> > > > >> >> >>>>> >> regards, > > > >> >> >>>>> >> Rahul > > > >> >> >>>>> >> > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: > > > >> >> >>>>> >>> > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov < > > > >> >> [email protected]> > > > >> >> >>>>> >>> wrote: > > > >> >> >>>>> >>>> > > > >> >> >>>>> >>>> Yep, ok. > > > >> >> >>>>> >>>> > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set up a > > maven > > > >> >> project > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a lot > > > lately). > > > >> Or > > > >> >> if you > > > >> >> >>>>> >>>> have a template to look at, it would be useful i guess > > > too. > > > >> >> >>>>> >>> > > > >> >> >>>>> >>> No, please go right ahead. > > > >> >> >>>>> >>> > > > >> >> >>>>> >>>> > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills < > > > >> >> [email protected]> > > > >> >> >>>>> wrote: > > > >> >> >>>>> >>>>> > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am happy > to > > > help. > > > >> >> Github > > > >> >> >>>>> >>>>> repo? > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" < > > > >> [email protected] > > > >> >> > > > > >> >> >>>>> wrote: > > > >> >> >>>>> >>>>> > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava > > prototype > > > on > > > >> >> top of > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both save > > time > > > and > > > >> >> prove or > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is viable. > > > >> >> >>>>> >>>>>> > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch > framework > > > or we > > > >> >> can keep > > > >> >> >>>>> >>>>>> it completely separate. > > > >> >> >>>>> >>>>>> > > > >> >> >>>>> >>>>>> -d > > > >> >> >>>>> >>>>>> > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills < > > > >> >> [email protected]> > > > >> >> >>>>> >>>>>> wrote: > > > >> >> >>>>> >>>>>>> > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who gave > > the > > > >> >> talk? Was > > > >> >> >>>>> it > > > >> >> >>>>> >>>>>>> Murray Stokely? > > > >> >> >>>>> >>>>>>> > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov < > > > >> >> >>>>> [email protected]> > > > >> >> >>>>> >>>>>> > > > >> >> >>>>> >>>>>> wrote: > > > >> >> >>>>> >>>>>>>> > > > >> >> >>>>> >>>>>>>> Hello, > > > >> >> >>>>> >>>>>>>> > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's > experience > > > of R > > > >> >> mapping > > > >> >> >>>>> of > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot > of > > > >> >> applications > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be > prototyped > > > using > > > >> >> flume R. > > > >> >> >>>>> >>>>>>>> > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google > > > implementation > > > >> of > > > >> >> R > > > >> >> >>>>> >>>>>>>> mapping, > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from R > to > > > >> Crunch > > > >> >> would > > > >> >> >>>>> be > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient). > > RJava/JRI > > > and > > > >> >> jni > > > >> >> >>>>> seem to > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that > directly. > > > >> >> >>>>> >>>>>>>> > > > >> >> >>>>> >>>>>>>> > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project could > > > have a > > > >> >> >>>>> contributed > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that > would > > > be > > > >> >> just a > > > >> >> >>>>> very > > > >> >> >>>>> >>>>>>>> good synergy. > > > >> >> >>>>> >>>>>>>> > > > >> >> >>>>> >>>>>>>> Is there anyone interested in > contributing/advising > > > for > > > >> open > > > >> >> >>>>> source > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging interest, > > > Crunch > > > >> >> list > > > >> >> >>>>> seems > > > >> >> >>>>> >>>>>>>> like a natural place to poke. > > > >> >> >>>>> >>>>>>>> > > > >> >> >>>>> >>>>>>>> Thanks . > > > >> >> >>>>> >>>>>>>> > > > >> >> >>>>> >>>>>>>> -Dmitriy > > > >> >> >>>>> >>>>>>> > > > >> >> >>>>> >>>>>>> > > > >> >> >>>>> >>>>>>> > > > >> >> >>>>> >>>>>>> -- > > > >> >> >>>>> >>>>>>> Director of Data Science > > > >> >> >>>>> >>>>>>> Cloudera > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills > > > >> >> >>>>> >>> > > > >> >> >>>>> >>> > > > >> >> >>>>> >>> > > > >> >> >>>>> >> > > > >> >> >>>>> > > > >> >> >>>> > > > >> >> >>>> > > > >> >> >>>> > > > >> >> >>>> -- > > > >> >> >>>> Director of Data Science > > > >> >> >>>> Cloudera <http://www.cloudera.com> > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> > > > >> >> > > > >> > > > > >> > > > > >> > > > > >> > -- > > > >> > Director of Data Science > > > >> > Cloudera <http://www.cloudera.com> > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > >> > > > > > > > > > > > > > > > > -- > > > > Director of Data Science > > > > Cloudera <http://www.cloudera.com> > > > > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > > > >
