On Sun, Nov 18, 2012 at 10:13 AM, Dmitriy Lyubimov <[email protected]> wrote: > Thank you, Josh. Your insights are greatly appreciated. > > RProtoBuf has a bug with <<- operator. I already contacted the authors and > they confirmed it however it is not clear when they are going to fix it. > > (code to reproduce: >> library(RProtoBuf) >> a <<- "A" > causes an error) > > Actually RProtoBuf is not used right now. I will move it into "recommended" > realm again if it makes things easier. > > For me, the hardest part was to make jvm +hadoop to see JRI library > actually. I am still not sure about the best course of action here but > there is definitely more than one way > > Also my apologies for code styling, it is probably the ugliest code i've > ever written, but i will tidy it up once past the proof of concept stage.
No judgements, man. You should have seen the first rev of Crunch. ;-) > > -d > > > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <[email protected]> wrote: > >> Dmitrity, >> >> Just sent you a pull request based on playing with the code on OS X. It >> contains a README about my experience getting things working. >> >> Unfortunately, I haven't succeeded in getting crunchR loaded, I'm running >> into some issues w/RProtoBuf on OS X. I'll give it another go this week on >> my Linux machine at work. >> >> J >> >> >> On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <[email protected] >> >wrote: >> >> > Josh, >> > >> > ok the following commit >> > >> > ============== >> > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3 >> > Author: Dmitriy Lyubimov <[email protected]> >> > Date: Sat Nov 17 12:29:27 2012 -0800 >> > >> > example1 succeeds >> > >> > ==================== >> > >> > runs example 1 for me successfully in a fully distributed way which is >> > first step (map-only thing) for the word count. >> > >> > (I think there's a hickup somewhere here because in the output i also >> seem >> > to see some empty lines, so the strsplit() part is perhaps set up >> somewhat >> > incorrectly here, but it's not the point right now): >> > >> > ====Example1.R=========== >> > >> > library(crunchR) >> > >> > pipeline <- crunchR.MRPipeline$new("test-pipeline") >> > >> > inputPCol <- pipeline$readTextFile("/crunchr-examples/input") >> > >> > outputPCol <- inputPCol$parallelDo( >> > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] ) >> > ) >> > >> > outputPCol$writeTextFile("/crunchr-examples/output") >> > >> > result <- pipeline$run() >> > >> > if ( !result$succeeded() ) stop ("pipeline failed.") >> > >> > ======================================== >> > >> > I think R-java communication now should support multiple doFn ok and they >> > will be properly shut down and executed and synchronized even if they >> emit >> > in the cleanup phase. >> > >> > This example assumes a lot of defaults (such as RTypes which are by >> default >> > character vector singleton in and character vector out for a DoFn). Also >> > obviously uses text in-text out at this point only. >> > >> > >> > To run, install the package and upload the test input (test-prep.sh) >> > Assuming you have compiled the maven part, the R package snapshot could >> be >> > installed by running "install-snapshot-rpkg.sh". >> > >> > You also need to make sure your backend tasks see JRI library. there are >> > multiple ways to do it i guess but for the purposes of testing the >> > following just works for me in my mapred-site: >> > >> > <property> >> > <name>mapred.child.java.opts</name> >> > >> > >> > >> >> <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri >> > </value> >> > <final>false</final> >> > </property> >> > >> > >> > I think at this point you guys might help me by doing review of that >> stuff, >> > asking questions and making suggestions how to go by incorporating other >> > types of doFn and perhaps a way to complete the word count example, >> perhaps >> > running comparative benchmarks with a java-only word count, how much >> > overhead we seem to be suffering here. >> > >> > I use StatEt in eclipse. Although it is a huge way forward, the process >> is >> > still extremely tedious since I don't know unit testing framework in R >> well >> > (so i just scribble some stuff on the side to unit-test this and that) >> and >> > the integration test running cycle is significant enough. >> > >> > Which is why any help and suggestions are very welcome! >> > >> > I will definitely add support for reading/writing sequence files and >> > Protobufs, as well as Mahout DRM's . >> > >> > >> > Thanks. >> > -Dmitrity >> > >> -- Director of Data Science Cloudera Twitter: @josh_wills
