Curious-- did you figure out a hack to make this work, or is this still an open issue?
On Fri, Nov 16, 2012 at 3:08 PM, Dmitriy Lyubimov <[email protected]> wrote: > Or RTNode? I guess i am not sure what difference is. > > Bottom line, i need to do some task startup routines (e.g. establish > exchange queues between task and R) and also last thing cleanup before MR > tasks exits and _before all outputs are closed_. (kind of "flush all" > thing). > > Thanks. > -d > > > On Fri, Nov 16, 2012 at 3:04 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > How do I hook into CrunchTaskContext to do a task cleanup (as opposed to > a > > DoFn etc.) ? > > > > > > On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <[email protected] > >wrote: > > > >> no it is fully distributed testing. > >> > >> It is ok, StatEt handles log4j logging for me so i see the logs. I was > >> wondering if any end-to-end diagnostics is already embedded in Crunch > but > >> reporting backend errors to front end is notoriously hard (and > sometimes, > >> impossible) with hadoop, so I assume it doesn't make sense to report > >> client-only stuff thru exception while the other stuff still requires > >> checking isSucceeded(). > >> > >> > >> > >> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <[email protected]> > wrote: > >> > >>> Are you running this using LocalJobRunner? Does calling > >>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help > >>> settle a debate I'm having w/Matthias. ;-) > >>> > >>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <[email protected]> > >>> wrote: > >>> > I see the error in the logs but Pipeline.run() has never thrown > >>> anything. > >>> > isSucceeded() subsequently returns false. Is there any way to extract > >>> > client-side problem rather than just being able to state that job > >>> failed? > >>> > or it is ok and the only diagnostics by design? > >>> > > >>> > ============ > >>> > 68124 [Thread-8] INFO org.apache.crunch.impl.mr.exec.CrunchJob - > >>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path > >>> > does not exist: hdfs://localhost:11010/crunchr-example/input > >>> > at > >>> > > >>> > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) > >>> > at > >>> > > >>> > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) > >>> > at > >>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944) > >>> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961) > >>> > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) > >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) > >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) > >>> > at java.security.AccessController.doPrivileged(Native Method) > >>> > at javax.security.auth.Subject.doAs(Subject.java:396) > >>> > at > >>> > > >>> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) > >>> > at > >>> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) > >>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476) > >>> > at > >>> > > >>> > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331) > >>> > at > org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135) > >>> > at > >>> > > >>> > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251) > >>> > at > >>> > > >>> > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279) > >>> > at java.lang.Thread.run(Thread.java:662) > >>> > > >>> > > >>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <[email protected] > > > >>> wrote: > >>> > > >>> >> for hadoop nodes i guess yet another option to soft-link the .so > into > >>> >> hadoop's native lib folder > >>> >> > >>> >> > >>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov < > [email protected] > >>> >wrote: > >>> >> > >>> >>> I actually want to defer this to hadoop admins, we just need to > >>> create a > >>> >>> procedure for setting up nodes. Ideally as simple as possible. > >>> something > >>> >>> like > >>> >>> > >>> >>> 1) setup R > >>> >>> 2) install.packages("rJava","RProtoBuf","crunchR") > >>> >>> 3) R CMD javareconf > >>> >>> 3) add result of R --vanilla <<< 'system.file("jri", > >>> package="rJava") to > >>> >>> either mapred command lines or LD_LIBRARY_PATH... > >>> >>> > >>> >>> but it will depend on their versions of hadoop, jre etc. I hoped > >>> crunch > >>> >>> might have something to hide a lot of that complexity (since it is > >>> about > >>> >>> hiding complexities, for the most part :) ) besides hadoop has a > >>> way to > >>> >>> ship .so's to the backend so if crunch had an api to do something > >>> similar > >>> >>> it is conceivable that driver might yank and ship it too to hide > that > >>> >>> complexity as well. But then there's a host of issues how to handle > >>> >>> potentially different rJava versions installed on different > nodes... > >>> So, it > >>> >>> increasingly looks like something we might want to defer to sysops > >>> to do > >>> >>> with approximate set of requirements . > >>> >>> > >>> >>> > >>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <[email protected]> > >>> wrote: > >>> >>> > >>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov < > >>> [email protected]> > >>> >>>> wrote: > >>> >>>> > >>> >>>> > so java tasks need to be able to load libjri.so from > >>> >>>> > whatever system.file("jri", package="rJava") says. > >>> >>>> > > >>> >>>> > Traditionally, these issues were handled with > -Djava.library.path. > >>> >>>> > Apparently there's nothing java task can do to enable > >>> loadLibrary() > >>> >>>> command > >>> >>>> > to see the damn library once started. But -Djava.library.path > >>> requires > >>> >>>> for > >>> >>>> > nodes to configure and lock jvm command line from modifications > >>> of the > >>> >>>> > client. which is fine. > >>> >>>> > > >>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre > 1.6 > >>> >>>> (again). > >>> >>>> > > >>> >>>> > but... any other suggestions about best practice configuring > >>> crunch to > >>> >>>> run > >>> >>>> > user's .so's? > >>> >>>> > > >>> >>>> > >>> >>>> Not off the top of my head. I suspect that whatever you come up > >>> with will > >>> >>>> become the "best practice." :) > >>> >>>> > >>> >>>> > > >>> >>>> > thanks. > >>> >>>> > > >>> >>>> > > >>> >>>> > > >>> >>>> > > >>> >>>> > > >>> >>>> > > >>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills < > [email protected] > >>> > > >>> >>>> wrote: > >>> >>>> > > >>> >>>> > > I believe that is a safe assumption, at least right now. > >>> >>>> > > > >>> >>>> > > > >>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov < > >>> [email protected] > >>> >>>> > > >>> >>>> > > wrote: > >>> >>>> > > > >>> >>>> > > > Question. > >>> >>>> > > > > >>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and > the > >>> >>>> process > >>> >>>> > > gets > >>> >>>> > > > emitter every time. > >>> >>>> > > > > >>> >>>> > > > However, my guess any single reincranation of a DoFn object > >>> in the > >>> >>>> > > backend > >>> >>>> > > > will always be getting the same emitter thru its lifecycle. > >>> Is it > >>> >>>> an > >>> >>>> > > > admissible assumption or there's currently a counter example > >>> to > >>> >>>> that? > >>> >>>> > > > > >>> >>>> > > > The problem is that as i implement the two way pipeline of > >>> input > >>> >>>> and > >>> >>>> > > > emitter data between R and Java, I am bulking these calls > >>> together > >>> >>>> for > >>> >>>> > > > performance reasons. Each individual datum in these chunks > of > >>> data > >>> >>>> will > >>> >>>> > > not > >>> >>>> > > > have attached emitter function information to them in any > way. > >>> >>>> (well it > >>> >>>> > > > could but it would be a performance killer and i bet emitter > >>> never > >>> >>>> > > > changes). > >>> >>>> > > > > >>> >>>> > > > So, thoughts? can i assume emitter never changes between > >>> first and > >>> >>>> lass > >>> >>>> > > > call to DoFn instance? > >>> >>>> > > > > >>> >>>> > > > thanks. > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov < > >>> >>>> [email protected]> > >>> >>>> > > > wrote: > >>> >>>> > > > > >>> >>>> > > > > yes... > >>> >>>> > > > > > >>> >>>> > > > > i think it worked for me before, although just adding all > >>> jars > >>> >>>> from R > >>> >>>> > > > > package distribution would be a little bit more > appropriate > >>> >>>> approach > >>> >>>> > > > > -- but it creates a problem with jars in dependent R > >>> packages. I > >>> >>>> > think > >>> >>>> > > > > it would be much easier to just compile a hadoop-job file > >>> and > >>> >>>> stick > >>> >>>> > it > >>> >>>> > > > > in rather than doing cherry-picking of individual jars > from > >>> who > >>> >>>> knows > >>> >>>> > > > > how many locations. > >>> >>>> > > > > > >>> >>>> > > > > i think i used the hadoop job format with distributed > cache > >>> >>>> before > >>> >>>> > and > >>> >>>> > > > > it worked... at least with Pig "register jar" > functionality. > >>> >>>> > > > > > >>> >>>> > > > > ok i guess i will just try if it works. > >>> >>>> > > > > > >>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills < > >>> [email protected] > >>> >>>> > > >>> >>>> > > wrote: > >>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov < > >>> >>>> > [email protected] > >>> >>>> > > > > >>> >>>> > > > > wrote: > >>> >>>> > > > > > > >>> >>>> > > > > >> Great! so it is in Crunch. > >>> >>>> > > > > >> > >>> >>>> > > > > >> does it support hadoop-job jar format or only pure java > >>> jars? > >>> >>>> > > > > >> > >>> >>>> > > > > > > >>> >>>> > > > > > I think just pure jars-- you're referring to hadoop-job > >>> format > >>> >>>> as > >>> >>>> > > > having > >>> >>>> > > > > > all the dependencies in a lib/ directory within the jar? > >>> >>>> > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > >> > >>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills < > >>> >>>> [email protected]> > >>> >>>> > > > > wrote: > >>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov < > >>> >>>> > > > [email protected]> > >>> >>>> > > > > >> wrote: > >>> >>>> > > > > >> > > >>> >>>> > > > > >> >> I think i need functionality to add more jars (or > >>> external > >>> >>>> > > > > hadoop-jar) > >>> >>>> > > > > >> >> to drive that from an R package. Just setting job > jar > >>> by > >>> >>>> class > >>> >>>> > is > >>> >>>> > > > not > >>> >>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal > >>> jar to > >>> >>>> R > >>> >>>> > > > package; > >>> >>>> > > > > >> >> however, i cannot really run hadoop command line on > >>> it, i > >>> >>>> need > >>> >>>> > to > >>> >>>> > > > set > >>> >>>> > > > > >> >> up classpath thru RJava. > >>> >>>> > > > > >> >> > >>> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely work > >>> here > >>> >>>> since > >>> >>>> > > we > >>> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather > >>> have to > >>> >>>> > > construct > >>> >>>> > > > > >> >> them on the fly. (well, we could serialize pipeline > >>> >>>> definitions > >>> >>>> > > > from > >>> >>>> > > > > R > >>> >>>> > > > > >> >> and then replay them in a driver -- but that's too > >>> >>>> cumbersome > >>> >>>> > and > >>> >>>> > > > > more > >>> >>>> > > > > >> >> work than it has to be.) There's no reason why i > >>> shouldn't > >>> >>>> be > >>> >>>> > > able > >>> >>>> > > > to > >>> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar" > >>> (mahout-like) > >>> >>>> when > >>> >>>> > > > kicking > >>> >>>> > > > > >> >> off a pipeline. > >>> >>>> > > > > >> >> > >>> >>>> > > > > >> > > >>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache? > >>> >>>> > > > > >> > > >>> >>>> > > > > >> > > >>> >>>> > > > > >> >> > >>> >>>> > > > > >> >> > >>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov < > >>> >>>> > > > > [email protected]> > >>> >>>> > > > > >> >> wrote: > >>> >>>> > > > > >> >> > Ok, sounds very promising... > >>> >>>> > > > > >> >> > > >>> >>>> > > > > >> >> > i'll try to start digging on the driver part this > >>> week > >>> >>>> then > >>> >>>> > > > > (Pipeline > >>> >>>> > > > > >> >> > wrapper in R5). > >>> >>>> > > > > >> >> > > >>> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills < > >>> >>>> > > > [email protected] > >>> >>>> > > > > > > >>> >>>> > > > > >> >> wrote: > >>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy > Lyubimov < > >>> >>>> > > > > [email protected] > >>> >>>> > > > > >> > > >>> >>>> > > > > >> >> wrote: > >>> >>>> > > > > >> >> >>> Ok, cool. > >>> >>>> > > > > >> >> >>> > >>> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a > >>> fairly > >>> >>>> > advanced > >>> >>>> > > > > state. > >>> >>>> > > > > >> >> >>> So every api mentioned in the FlumeJava paper > is > >>> >>>> working , > >>> >>>> > > > > right? > >>> >>>> > > > > >> Or > >>> >>>> > > > > >> >> >>> there's something that is not working > >>> specifically? > >>> >>>> > > > > >> >> >> > >>> >>>> > > > > >> >> >> I think the only thing in the paper that we don't > >>> have > >>> >>>> in a > >>> >>>> > > > > working > >>> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question > >>> of > >>> >>>> > > > prioritizing > >>> >>>> > > > > it > >>> >>>> > > > > >> >> >> and getting the work done. > >>> >>>> > > > > >> >> >> > >>> >>>> > > > > >> >> >>> > >>> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills < > >>> >>>> > > > [email protected] > >>> >>>> > > > > > > >>> >>>> > > > > >> >> wrote: > >>> >>>> > > > > >> >> >>>> Hey Dmitriy, > >>> >>>> > > > > >> >> >>>> > >>> >>>> > > > > >> >> >>>> Got a fork going and looking forward to playing > >>> with > >>> >>>> > crunchR > >>> >>>> > > > > this > >>> >>>> > > > > >> >> weekend-- > >>> >>>> > > > > >> >> >>>> thanks! > >>> >>>> > > > > >> >> >>>> > >>> >>>> > > > > >> >> >>>> J > >>> >>>> > > > > >> >> >>>> > >>> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy > >>> Lyubimov < > >>> >>>> > > > > >> [email protected]> > >>> >>>> > > > > >> >> wrote: > >>> >>>> > > > > >> >> >>>> > >>> >>>> > > > > >> >> >>>>> Project template > >>> >>>> https://github.com/dlyubimov/crunchR > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact . > R > >>> >>>> profile > >>> >>>> > > > > compiles R > >>> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by > >>> >>>> supplying -DR > >>> >>>> > > to > >>> >>>> > > > > mvn > >>> >>>> > > > > >> >> >>>>> command line, e.g. > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> mvn install -DR > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> there's also a helper that installs the > snapshot > >>> >>>> version > >>> >>>> > of > >>> >>>> > > > the > >>> >>>> > > > > >> >> >>>>> package in the crunchR module. > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which > i > >>> did > >>> >>>> not > >>> >>>> > > find > >>> >>>> > > > > >> anywhere > >>> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed into > >>> my > >>> >>>> github > >>> >>>> > > > maven > >>> >>>> > > > > >> repo > >>> >>>> > > > > >> >> so > >>> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party. > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and > >>> optionally, > >>> >>>> > > RProtoBuf. > >>> >>>> > > > R > >>> >>>> > > > > Doc > >>> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think). > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into > >>> >>>> another > >>> >>>> > > > package, > >>> >>>> > > > > >> got a > >>> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf > >>> into > >>> >>>> > crunchR, > >>> >>>> > > so > >>> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down > >>> the > >>> >>>> road > >>> >>>> > that > >>> >>>> > > > may > >>> >>>> > > > > >> be a > >>> >>>> > > > > >> >> >>>>> problem though... > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> other than the template, not much else has > been > >>> done > >>> >>>> so > >>> >>>> > > > far... > >>> >>>> > > > > >> >> finding > >>> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package > >>> path on > >>> >>>> > > > > >> initialization > >>> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars > >>> and its > >>> >>>> > > > > >> non-"provided" > >>> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part... > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> No legal stuff... > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this point. > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy > >>> Lyubimov < > >>> >>>> > > > > >> >> [email protected]> > >>> >>>> > > > > >> >> >>>>> wrote: > >>> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project > template > >>> by > >>> >>>> some > >>> >>>> > > time > >>> >>>> > > > > next > >>> >>>> > > > > >> >> week. > >>> >>>> > > > > >> >> >>>>> > we can start with prototyping and > benchmarking > >>> >>>> > something > >>> >>>> > > > > really > >>> >>>> > > > > >> >> >>>>> > simple, such as parallelDo(). > >>> >>>> > > > > >> >> >>>>> > > >>> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more > >>> or > >>> >>>> less > >>> >>>> > > simple > >>> >>>> > > > > >> >> algorithm > >>> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved > >>> with > >>> >>>> > Rcrunch > >>> >>>> > > > (or > >>> >>>> > > > > >> >> whatever > >>> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time > >>> >>>> (performance) > >>> >>>> > but > >>> >>>> > > > > with > >>> >>>> > > > > >> much > >>> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of > >>> factorization or > >>> >>>> > > > clustering > >>> >>>> > > > > >> >> things) > >>> >>>> > > > > >> >> >>>>> > > >>> >>>> > > > > >> >> >>>>> > > >>> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul < > >>> >>>> > > [email protected] > >>> >>>> > > > > > >>> >>>> > > > > >> wrote: > >>> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested > >>> to > >>> >>>> see how > >>> >>>> > > > well > >>> >>>> > > > > we > >>> >>>> > > > > >> can > >>> >>>> > > > > >> >> >>>>> integrate > >>> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help. > >>> >>>> > > > > >> >> >>>>> >> > >>> >>>> > > > > >> >> >>>>> >> regards, > >>> >>>> > > > > >> >> >>>>> >> Rahul > >>> >>>> > > > > >> >> >>>>> >> > >>> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: > >>> >>>> > > > > >> >> >>>>> >>> > >>> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy > >>> >>>> Lyubimov < > >>> >>>> > > > > >> >> [email protected]> > >>> >>>> > > > > >> >> >>>>> >>> wrote: > >>> >>>> > > > > >> >> >>>>> >>>> > >>> >>>> > > > > >> >> >>>>> >>>> Yep, ok. > >>> >>>> > > > > >> >> >>>>> >>>> > >>> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I > >>> can set > >>> >>>> up a > >>> >>>> > > > maven > >>> >>>> > > > > >> >> project > >>> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing > >>> that a > >>> >>>> lot > >>> >>>> > > > > lately). > >>> >>>> > > > > >> Or > >>> >>>> > > > > >> >> if you > >>> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be > >>> useful i > >>> >>>> > guess > >>> >>>> > > > > too. > >>> >>>> > > > > >> >> >>>>> >>> > >>> >>>> > > > > >> >> >>>>> >>> No, please go right ahead. > >>> >>>> > > > > >> >> >>>>> >>> > >>> >>>> > > > > >> >> >>>>> >>>> > >>> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh > >>> Wills < > >>> >>>> > > > > >> >> [email protected]> > >>> >>>> > > > > >> >> >>>>> wrote: > >>> >>>> > > > > >> >> >>>>> >>>>> > >>> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but > >>> I am > >>> >>>> happy > >>> >>>> > > to > >>> >>>> > > > > help. > >>> >>>> > > > > >> >> Github > >>> >>>> > > > > >> >> >>>>> >>>>> repo? > >>> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy > >>> Lyubimov" < > >>> >>>> > > > > >> [email protected] > >>> >>>> > > > > >> >> > > >>> >>>> > > > > >> >> >>>>> wrote: > >>> >>>> > > > > >> >> >>>>> >>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a > >>> JRI/RJava > >>> >>>> > > > prototype > >>> >>>> > > > > on > >>> >>>> > > > > >> >> top of > >>> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This > should > >>> both > >>> >>>> save > >>> >>>> > > > time > >>> >>>> > > > > and > >>> >>>> > > > > >> >> prove or > >>> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava > integration > >>> is > >>> >>>> > viable. > >>> >>>> > > > > >> >> >>>>> >>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within > >>> Crunch > >>> >>>> > > framework > >>> >>>> > > > > or we > >>> >>>> > > > > >> >> can keep > >>> >>>> > > > > >> >> >>>>> >>>>>> it completely separate. > >>> >>>> > > > > >> >> >>>>> >>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>> -d > >>> >>>> > > > > >> >> >>>>> >>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh > >>> Wills < > >>> >>>> > > > > >> >> [email protected]> > >>> >>>> > > > > >> >> >>>>> >>>>>> wrote: > >>> >>>> > > > > >> >> >>>>> >>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into > >>> it-- > >>> >>>> who > >>> >>>> > gave > >>> >>>> > > > the > >>> >>>> > > > > >> >> talk? Was > >>> >>>> > > > > >> >> >>>>> it > >>> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely? > >>> >>>> > > > > >> >> >>>>> >>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, > Dmitriy > >>> >>>> > Lyubimov < > >>> >>>> > > > > >> >> >>>>> [email protected]> > >>> >>>> > > > > >> >> >>>>> >>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>> wrote: > >>> >>>> > > > > >> >> >>>>> >>>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> Hello, > >>> >>>> > > > > >> >> >>>>> >>>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of > >>> Google's > >>> >>>> > > experience > >>> >>>> > > > > of R > >>> >>>> > > > > >> >> mapping > >>> >>>> > > > > >> >> >>>>> of > >>> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I > >>> think > >>> >>>> a > >>> >>>> > lot > >>> >>>> > > of > >>> >>>> > > > > >> >> applications > >>> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could > >>> be > >>> >>>> > > prototyped > >>> >>>> > > > > using > >>> >>>> > > > > >> >> flume R. > >>> >>>> > > > > >> >> >>>>> >>>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of > >>> Google > >>> >>>> > > > > implementation > >>> >>>> > > > > >> of > >>> >>>> > > > > >> >> R > >>> >>>> > > > > >> >> >>>>> >>>>>>>> mapping, > >>> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct > >>> mapping > >>> >>>> from > >>> >>>> > R > >>> >>>> > > to > >>> >>>> > > > > >> Crunch > >>> >>>> > > > > >> >> would > >>> >>>> > > > > >> >> >>>>> be > >>> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, > >>> efficient). > >>> >>>> > > > RJava/JRI > >>> >>>> > > > > and > >>> >>>> > > > > >> >> jni > >>> >>>> > > > > >> >> >>>>> seem to > >>> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do > >>> that > >>> >>>> > > directly. > >>> >>>> > > > > >> >> >>>>> >>>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this > >>> project > >>> >>>> > could > >>> >>>> > > > > have a > >>> >>>> > > > > >> >> >>>>> contributed > >>> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed > >>> matrices, > >>> >>>> that > >>> >>>> > > would > >>> >>>> > > > > be > >>> >>>> > > > > >> >> just a > >>> >>>> > > > > >> >> >>>>> very > >>> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy. > >>> >>>> > > > > >> >> >>>>> >>>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in > >>> >>>> > > contributing/advising > >>> >>>> > > > > for > >>> >>>> > > > > >> open > >>> >>>> > > > > >> >> >>>>> source > >>> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just > >>> gauging > >>> >>>> > interest, > >>> >>>> > > > > Crunch > >>> >>>> > > > > >> >> list > >>> >>>> > > > > >> >> >>>>> seems > >>> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke. > >>> >>>> > > > > >> >> >>>>> >>>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks . > >>> >>>> > > > > >> >> >>>>> >>>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy > >>> >>>> > > > > >> >> >>>>> >>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>> > >>> >>>> > > > > >> >> >>>>> >>>>>>> -- > >>> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science > >>> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera > >>> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills > >>> >>>> > > > > >> >> >>>>> >>> > >>> >>>> > > > > >> >> >>>>> >>> > >>> >>>> > > > > >> >> >>>>> >>> > >>> >>>> > > > > >> >> >>>>> >> > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>> > >>> >>>> > > > > >> >> >>>> > >>> >>>> > > > > >> >> >>>> > >>> >>>> > > > > >> >> >>>> -- > >>> >>>> > > > > >> >> >>>> Director of Data Science > >>> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com> > >>> >>>> > > > > >> >> >>>> Twitter: @josh_wills < > >>> http://twitter.com/josh_wills> > >>> >>>> > > > > >> >> > >>> >>>> > > > > >> > > >>> >>>> > > > > >> > > >>> >>>> > > > > >> > > >>> >>>> > > > > >> > -- > >>> >>>> > > > > >> > Director of Data Science > >>> >>>> > > > > >> > Cloudera <http://www.cloudera.com> > >>> >>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills> > >>> >>>> > > > > >> > >>> >>>> > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > > > >>> >>>> > > > > > -- > >>> >>>> > > > > > Director of Data Science > >>> >>>> > > > > > Cloudera <http://www.cloudera.com> > >>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills> > >>> >>>> > > > > > >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> -- > >>> >>>> Director of Data Science > >>> >>>> Cloudera <http://www.cloudera.com> > >>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> > >>> >>>> > >>> >>> > >>> >>> > >>> >> > >>> > >>> > >>> > >>> -- > >>> Director of Data Science > >>> Cloudera > >>> Twitter: @josh_wills > >>> > >> > >> > > >
