I see the error in the logs but Pipeline.run() has never thrown anything. isSucceeded() subsequently returns false. Is there any way to extract client-side problem rather than just being able to state that job failed? or it is ok and the only diagnostics by design?
============ 68124 [Thread-8] INFO org.apache.crunch.impl.mr.exec.CrunchJob - org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:11010/crunchr-example/input at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961) at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) at org.apache.hadoop.mapreduce.Job.submit(Job.java:476) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331) at org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279) at java.lang.Thread.run(Thread.java:662) On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <[email protected]> wrote: > for hadoop nodes i guess yet another option to soft-link the .so into > hadoop's native lib folder > > > On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <[email protected]>wrote: > >> I actually want to defer this to hadoop admins, we just need to create a >> procedure for setting up nodes. Ideally as simple as possible. something >> like >> >> 1) setup R >> 2) install.packages("rJava","RProtoBuf","crunchR") >> 3) R CMD javareconf >> 3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to >> either mapred command lines or LD_LIBRARY_PATH... >> >> but it will depend on their versions of hadoop, jre etc. I hoped crunch >> might have something to hide a lot of that complexity (since it is about >> hiding complexities, for the most part :) ) besides hadoop has a way to >> ship .so's to the backend so if crunch had an api to do something similar >> it is conceivable that driver might yank and ship it too to hide that >> complexity as well. But then there's a host of issues how to handle >> potentially different rJava versions installed on different nodes... So, it >> increasingly looks like something we might want to defer to sysops to do >> with approximate set of requirements . >> >> >> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <[email protected]> wrote: >> >>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> >>> > so java tasks need to be able to load libjri.so from >>> > whatever system.file("jri", package="rJava") says. >>> > >>> > Traditionally, these issues were handled with -Djava.library.path. >>> > Apparently there's nothing java task can do to enable loadLibrary() >>> command >>> > to see the damn library once started. But -Djava.library.path requires >>> for >>> > nodes to configure and lock jvm command line from modifications of the >>> > client. which is fine. >>> > >>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6 >>> (again). >>> > >>> > but... any other suggestions about best practice configuring crunch to >>> run >>> > user's .so's? >>> > >>> >>> Not off the top of my head. I suspect that whatever you come up with will >>> become the "best practice." :) >>> >>> > >>> > thanks. >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <[email protected]> >>> wrote: >>> > >>> > > I believe that is a safe assumption, at least right now. >>> > > >>> > > >>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <[email protected] >>> > >>> > > wrote: >>> > > >>> > > > Question. >>> > > > >>> > > > So in Crunch api, initialize() doesn't get an emitter. and the >>> process >>> > > gets >>> > > > emitter every time. >>> > > > >>> > > > However, my guess any single reincranation of a DoFn object in the >>> > > backend >>> > > > will always be getting the same emitter thru its lifecycle. Is it >>> an >>> > > > admissible assumption or there's currently a counter example to >>> that? >>> > > > >>> > > > The problem is that as i implement the two way pipeline of input >>> and >>> > > > emitter data between R and Java, I am bulking these calls together >>> for >>> > > > performance reasons. Each individual datum in these chunks of data >>> will >>> > > not >>> > > > have attached emitter function information to them in any way. >>> (well it >>> > > > could but it would be a performance killer and i bet emitter never >>> > > > changes). >>> > > > >>> > > > So, thoughts? can i assume emitter never changes between first and >>> lass >>> > > > call to DoFn instance? >>> > > > >>> > > > thanks. >>> > > > >>> > > > >>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov < >>> [email protected]> >>> > > > wrote: >>> > > > >>> > > > > yes... >>> > > > > >>> > > > > i think it worked for me before, although just adding all jars >>> from R >>> > > > > package distribution would be a little bit more appropriate >>> approach >>> > > > > -- but it creates a problem with jars in dependent R packages. I >>> > think >>> > > > > it would be much easier to just compile a hadoop-job file and >>> stick >>> > it >>> > > > > in rather than doing cherry-picking of individual jars from who >>> knows >>> > > > > how many locations. >>> > > > > >>> > > > > i think i used the hadoop job format with distributed cache >>> before >>> > and >>> > > > > it worked... at least with Pig "register jar" functionality. >>> > > > > >>> > > > > ok i guess i will just try if it works. >>> > > > > >>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <[email protected] >>> > >>> > > wrote: >>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov < >>> > [email protected] >>> > > > >>> > > > > wrote: >>> > > > > > >>> > > > > >> Great! so it is in Crunch. >>> > > > > >> >>> > > > > >> does it support hadoop-job jar format or only pure java jars? >>> > > > > >> >>> > > > > > >>> > > > > > I think just pure jars-- you're referring to hadoop-job format >>> as >>> > > > having >>> > > > > > all the dependencies in a lib/ directory within the jar? >>> > > > > > >>> > > > > > >>> > > > > >> >>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills < >>> [email protected]> >>> > > > > wrote: >>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov < >>> > > > [email protected]> >>> > > > > >> wrote: >>> > > > > >> > >>> > > > > >> >> I think i need functionality to add more jars (or external >>> > > > > hadoop-jar) >>> > > > > >> >> to drive that from an R package. Just setting job jar by >>> class >>> > is >>> > > > not >>> > > > > >> >> enough. I can push overall job-jar as an addiitonal jar to >>> R >>> > > > package; >>> > > > > >> >> however, i cannot really run hadoop command line on it, i >>> need >>> > to >>> > > > set >>> > > > > >> >> up classpath thru RJava. >>> > > > > >> >> >>> > > > > >> >> Traditional single hadoop job jar will unlikely work here >>> since >>> > > we >>> > > > > >> >> cannot hardcode pipelines in java code but rather have to >>> > > construct >>> > > > > >> >> them on the fly. (well, we could serialize pipeline >>> definitions >>> > > > from >>> > > > > R >>> > > > > >> >> and then replay them in a driver -- but that's too >>> cumbersome >>> > and >>> > > > > more >>> > > > > >> >> work than it has to be.) There's no reason why i shouldn't >>> be >>> > > able >>> > > > to >>> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) >>> when >>> > > > kicking >>> > > > > >> >> off a pipeline. >>> > > > > >> >> >>> > > > > >> > >>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache? >>> > > > > >> > >>> > > > > >> > >>> > > > > >> >> >>> > > > > >> >> >>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov < >>> > > > > [email protected]> >>> > > > > >> >> wrote: >>> > > > > >> >> > Ok, sounds very promising... >>> > > > > >> >> > >>> > > > > >> >> > i'll try to start digging on the driver part this week >>> then >>> > > > > (Pipeline >>> > > > > >> >> > wrapper in R5). >>> > > > > >> >> > >>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills < >>> > > > [email protected] >>> > > > > > >>> > > > > >> >> wrote: >>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov < >>> > > > > [email protected] >>> > > > > >> > >>> > > > > >> >> wrote: >>> > > > > >> >> >>> Ok, cool. >>> > > > > >> >> >>> >>> > > > > >> >> >>> So what state is Crunch in? I take it is in a fairly >>> > advanced >>> > > > > state. >>> > > > > >> >> >>> So every api mentioned in the FlumeJava paper is >>> working , >>> > > > > right? >>> > > > > >> Or >>> > > > > >> >> >>> there's something that is not working specifically? >>> > > > > >> >> >> >>> > > > > >> >> >> I think the only thing in the paper that we don't have >>> in a >>> > > > > working >>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of >>> > > > prioritizing >>> > > > > it >>> > > > > >> >> >> and getting the work done. >>> > > > > >> >> >> >>> > > > > >> >> >>> >>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills < >>> > > > [email protected] >>> > > > > > >>> > > > > >> >> wrote: >>> > > > > >> >> >>>> Hey Dmitriy, >>> > > > > >> >> >>>> >>> > > > > >> >> >>>> Got a fork going and looking forward to playing with >>> > crunchR >>> > > > > this >>> > > > > >> >> weekend-- >>> > > > > >> >> >>>> thanks! >>> > > > > >> >> >>>> >>> > > > > >> >> >>>> J >>> > > > > >> >> >>>> >>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov < >>> > > > > >> [email protected]> >>> > > > > >> >> wrote: >>> > > > > >> >> >>>> >>> > > > > >> >> >>>>> Project template >>> https://github.com/dlyubimov/crunchR >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> Default profile does not compile R artifact . R >>> profile >>> > > > > compiles R >>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by >>> supplying -DR >>> > > to >>> > > > > mvn >>> > > > > >> >> >>>>> command line, e.g. >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> mvn install -DR >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> there's also a helper that installs the snapshot >>> version >>> > of >>> > > > the >>> > > > > >> >> >>>>> package in the crunchR module. >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i did >>> not >>> > > find >>> > > > > >> anywhere >>> > > > > >> >> >>>>> in public maven repos; so it is installed into my >>> github >>> > > > maven >>> > > > > >> repo >>> > > > > >> >> so >>> > > > > >> >> >>>>> far. Should compile for 3rd party. >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally, >>> > > RProtoBuf. >>> > > > R >>> > > > > Doc >>> > > > > >> >> >>>>> compilation requires roxygen2 (i think). >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into >>> another >>> > > > package, >>> > > > > >> got a >>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into >>> > crunchR, >>> > > so >>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the >>> road >>> > that >>> > > > may >>> > > > > >> be a >>> > > > > >> >> >>>>> problem though... >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> other than the template, not much else has been done >>> so >>> > > > far... >>> > > > > >> >> finding >>> > > > > >> >> >>>>> hadoop libraries and adding it to the package path on >>> > > > > >> initialization >>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its >>> > > > > >> non-"provided" >>> > > > > >> >> >>>>> transitives to the crunchR's java part... >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> No legal stuff... >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> No readmes... complete stealth at this point. >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov < >>> > > > > >> >> [email protected]> >>> > > > > >> >> >>>>> wrote: >>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template by >>> some >>> > > time >>> > > > > next >>> > > > > >> >> week. >>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking >>> > something >>> > > > > really >>> > > > > >> >> >>>>> > simple, such as parallelDo(). >>> > > > > >> >> >>>>> > >>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or >>> less >>> > > simple >>> > > > > >> >> algorithm >>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with >>> > Rcrunch >>> > > > (or >>> > > > > >> >> whatever >>> > > > > >> >> >>>>> > name it has to be) in a comparable time >>> (performance) >>> > but >>> > > > > with >>> > > > > >> much >>> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization or >>> > > > clustering >>> > > > > >> >> things) >>> > > > > >> >> >>>>> > >>> > > > > >> >> >>>>> > >>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul < >>> > > [email protected] >>> > > > > >>> > > > > >> wrote: >>> > > > > >> >> >>>>> >> I am not much of R user but I am interested to >>> see how >>> > > > well >>> > > > > we >>> > > > > >> can >>> > > > > >> >> >>>>> integrate >>> > > > > >> >> >>>>> >> the two. I would be happy to help. >>> > > > > >> >> >>>>> >> >>> > > > > >> >> >>>>> >> regards, >>> > > > > >> >> >>>>> >> Rahul >>> > > > > >> >> >>>>> >> >>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: >>> > > > > >> >> >>>>> >>> >>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy >>> Lyubimov < >>> > > > > >> >> [email protected]> >>> > > > > >> >> >>>>> >>> wrote: >>> > > > > >> >> >>>>> >>>> >>> > > > > >> >> >>>>> >>>> Yep, ok. >>> > > > > >> >> >>>>> >>>> >>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set >>> up a >>> > > > maven >>> > > > > >> >> project >>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a >>> lot >>> > > > > lately). >>> > > > > >> Or >>> > > > > >> >> if you >>> > > > > >> >> >>>>> >>>> have a template to look at, it would be useful i >>> > guess >>> > > > > too. >>> > > > > >> >> >>>>> >>> >>> > > > > >> >> >>>>> >>> No, please go right ahead. >>> > > > > >> >> >>>>> >>> >>> > > > > >> >> >>>>> >>>> >>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills < >>> > > > > >> >> [email protected]> >>> > > > > >> >> >>>>> wrote: >>> > > > > >> >> >>>>> >>>>> >>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am >>> happy >>> > > to >>> > > > > help. >>> > > > > >> >> Github >>> > > > > >> >> >>>>> >>>>> repo? >>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" < >>> > > > > >> [email protected] >>> > > > > >> >> > >>> > > > > >> >> >>>>> wrote: >>> > > > > >> >> >>>>> >>>>> >>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava >>> > > > prototype >>> > > > > on >>> > > > > >> >> top of >>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both >>> save >>> > > > time >>> > > > > and >>> > > > > >> >> prove or >>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is >>> > viable. >>> > > > > >> >> >>>>> >>>>>> >>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch >>> > > framework >>> > > > > or we >>> > > > > >> >> can keep >>> > > > > >> >> >>>>> >>>>>> it completely separate. >>> > > > > >> >> >>>>> >>>>>> >>> > > > > >> >> >>>>> >>>>>> -d >>> > > > > >> >> >>>>> >>>>>> >>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills < >>> > > > > >> >> [email protected]> >>> > > > > >> >> >>>>> >>>>>> wrote: >>> > > > > >> >> >>>>> >>>>>>> >>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- >>> who >>> > gave >>> > > > the >>> > > > > >> >> talk? Was >>> > > > > >> >> >>>>> it >>> > > > > >> >> >>>>> >>>>>>> Murray Stokely? >>> > > > > >> >> >>>>> >>>>>>> >>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy >>> > Lyubimov < >>> > > > > >> >> >>>>> [email protected]> >>> > > > > >> >> >>>>> >>>>>> >>> > > > > >> >> >>>>> >>>>>> wrote: >>> > > > > >> >> >>>>> >>>>>>>> >>> > > > > >> >> >>>>> >>>>>>>> Hello, >>> > > > > >> >> >>>>> >>>>>>>> >>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's >>> > > experience >>> > > > > of R >>> > > > > >> >> mapping >>> > > > > >> >> >>>>> of >>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think >>> a >>> > lot >>> > > of >>> > > > > >> >> applications >>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be >>> > > prototyped >>> > > > > using >>> > > > > >> >> flume R. >>> > > > > >> >> >>>>> >>>>>>>> >>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google >>> > > > > implementation >>> > > > > >> of >>> > > > > >> >> R >>> > > > > >> >> >>>>> >>>>>>>> mapping, >>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping >>> from >>> > R >>> > > to >>> > > > > >> Crunch >>> > > > > >> >> would >>> > > > > >> >> >>>>> be >>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient). >>> > > > RJava/JRI >>> > > > > and >>> > > > > >> >> jni >>> > > > > >> >> >>>>> seem to >>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that >>> > > directly. >>> > > > > >> >> >>>>> >>>>>>>> >>> > > > > >> >> >>>>> >>>>>>>> >>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project >>> > could >>> > > > > have a >>> > > > > >> >> >>>>> contributed >>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, >>> that >>> > > would >>> > > > > be >>> > > > > >> >> just a >>> > > > > >> >> >>>>> very >>> > > > > >> >> >>>>> >>>>>>>> good synergy. >>> > > > > >> >> >>>>> >>>>>>>> >>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in >>> > > contributing/advising >>> > > > > for >>> > > > > >> open >>> > > > > >> >> >>>>> source >>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging >>> > interest, >>> > > > > Crunch >>> > > > > >> >> list >>> > > > > >> >> >>>>> seems >>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke. >>> > > > > >> >> >>>>> >>>>>>>> >>> > > > > >> >> >>>>> >>>>>>>> Thanks . >>> > > > > >> >> >>>>> >>>>>>>> >>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy >>> > > > > >> >> >>>>> >>>>>>> >>> > > > > >> >> >>>>> >>>>>>> >>> > > > > >> >> >>>>> >>>>>>> >>> > > > > >> >> >>>>> >>>>>>> -- >>> > > > > >> >> >>>>> >>>>>>> Director of Data Science >>> > > > > >> >> >>>>> >>>>>>> Cloudera >>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills >>> > > > > >> >> >>>>> >>> >>> > > > > >> >> >>>>> >>> >>> > > > > >> >> >>>>> >>> >>> > > > > >> >> >>>>> >> >>> > > > > >> >> >>>>> >>> > > > > >> >> >>>> >>> > > > > >> >> >>>> >>> > > > > >> >> >>>> >>> > > > > >> >> >>>> -- >>> > > > > >> >> >>>> Director of Data Science >>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com> >>> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> > > > > >> >> >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > -- >>> > > > > >> > Director of Data Science >>> > > > > >> > Cloudera <http://www.cloudera.com> >>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills> >>> > > > > >> >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > -- >>> > > > > > Director of Data Science >>> > > > > > Cloudera <http://www.cloudera.com> >>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills> >>> > > > > >>> > > > >>> > > >>> > >>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >> >> >
