Dmitrity, Just sent you a pull request based on playing with the code on OS X. It contains a README about my experience getting things working.
Unfortunately, I haven't succeeded in getting crunchR loaded, I'm running into some issues w/RProtoBuf on OS X. I'll give it another go this week on my Linux machine at work. J On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <[email protected]>wrote: > Josh, > > ok the following commit > > ============== > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3 > Author: Dmitriy Lyubimov <[email protected]> > Date: Sat Nov 17 12:29:27 2012 -0800 > > example1 succeeds > > ==================== > > runs example 1 for me successfully in a fully distributed way which is > first step (map-only thing) for the word count. > > (I think there's a hickup somewhere here because in the output i also seem > to see some empty lines, so the strsplit() part is perhaps set up somewhat > incorrectly here, but it's not the point right now): > > ====Example1.R=========== > > library(crunchR) > > pipeline <- crunchR.MRPipeline$new("test-pipeline") > > inputPCol <- pipeline$readTextFile("/crunchr-examples/input") > > outputPCol <- inputPCol$parallelDo( > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] ) > ) > > outputPCol$writeTextFile("/crunchr-examples/output") > > result <- pipeline$run() > > if ( !result$succeeded() ) stop ("pipeline failed.") > > ======================================== > > I think R-java communication now should support multiple doFn ok and they > will be properly shut down and executed and synchronized even if they emit > in the cleanup phase. > > This example assumes a lot of defaults (such as RTypes which are by default > character vector singleton in and character vector out for a DoFn). Also > obviously uses text in-text out at this point only. > > > To run, install the package and upload the test input (test-prep.sh) > Assuming you have compiled the maven part, the R package snapshot could be > installed by running "install-snapshot-rpkg.sh". > > You also need to make sure your backend tasks see JRI library. there are > multiple ways to do it i guess but for the purposes of testing the > following just works for me in my mapred-site: > > <property> > <name>mapred.child.java.opts</name> > > > > <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri > </value> > <final>false</final> > </property> > > > I think at this point you guys might help me by doing review of that stuff, > asking questions and making suggestions how to go by incorporating other > types of doFn and perhaps a way to complete the word count example, perhaps > running comparative benchmarks with a java-only word count, how much > overhead we seem to be suffering here. > > I use StatEt in eclipse. Although it is a huge way forward, the process is > still extremely tedious since I don't know unit testing framework in R well > (so i just scribble some stuff on the side to unit-test this and that) and > the integration test running cycle is significant enough. > > Which is why any help and suggestions are very welcome! > > I will definitely add support for reading/writing sequence files and > Protobufs, as well as Mahout DRM's . > > > Thanks. > -Dmitrity >
