Ok , I guess i am going to work on the next milestone which is PTableType serialization support between R and java sides.
once i am done with that, i guess i will be able to add other api and complete word count example fairly easily. Example1.R in its current state works. On Wed, Nov 21, 2012 at 12:11 PM, Josh Wills <[email protected]> wrote: > I'm going to play with this again over the break-- BTW, did you see Renjin? > I somehow missed this, but it looks interesting. > > http://code.google.com/p/renjin/ > > > On Sun, Nov 18, 2012 at 11:44 AM, Dmitriy Lyubimov <[email protected] > >wrote: > > > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <[email protected]> > wrote: > > > > > Dmitrity, > > > > > > Just sent you a pull request based on playing with the code on OS X. It > > > contains a README about my experience getting things working. > > > > > > > Are you sure it is doxygen package? I thought it was roxygen2 package? > > > > Actually there seems currently no best practice in existence for R5 > classes > > + roxygen2 (and the guy ignores @import order of files, too). Hence the > > hacks with file names. > > > > > > > Unfortunately, I haven't succeeded in getting crunchR loaded, I'm > running > > > into some issues w/RProtoBuf on OS X. I'll give it another go this week > > on > > > my Linux machine at work. > > > > > ok i removed @import RProtoBuf, you should be able to install w/o it. > Maven > > still compiles protoc stuff though. > > > > > > > > J > > > > > > > > > On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <[email protected] > > > >wrote: > > > > > > > Josh, > > > > > > > > ok the following commit > > > > > > > > ============== > > > > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3 > > > > Author: Dmitriy Lyubimov <[email protected]> > > > > Date: Sat Nov 17 12:29:27 2012 -0800 > > > > > > > > example1 succeeds > > > > > > > > ==================== > > > > > > > > runs example 1 for me successfully in a fully distributed way which > is > > > > first step (map-only thing) for the word count. > > > > > > > > (I think there's a hickup somewhere here because in the output i also > > > seem > > > > to see some empty lines, so the strsplit() part is perhaps set up > > > somewhat > > > > incorrectly here, but it's not the point right now): > > > > > > > > ====Example1.R=========== > > > > > > > > library(crunchR) > > > > > > > > pipeline <- crunchR.MRPipeline$new("test-pipeline") > > > > > > > > inputPCol <- pipeline$readTextFile("/crunchr-examples/input") > > > > > > > > outputPCol <- inputPCol$parallelDo( > > > > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] ) > > > > ) > > > > > > > > outputPCol$writeTextFile("/crunchr-examples/output") > > > > > > > > result <- pipeline$run() > > > > > > > > if ( !result$succeeded() ) stop ("pipeline failed.") > > > > > > > > ======================================== > > > > > > > > I think R-java communication now should support multiple doFn ok and > > they > > > > will be properly shut down and executed and synchronized even if they > > > emit > > > > in the cleanup phase. > > > > > > > > This example assumes a lot of defaults (such as RTypes which are by > > > default > > > > character vector singleton in and character vector out for a DoFn). > > Also > > > > obviously uses text in-text out at this point only. > > > > > > > > > > > > To run, install the package and upload the test input (test-prep.sh) > > > > Assuming you have compiled the maven part, the R package snapshot > could > > > be > > > > installed by running "install-snapshot-rpkg.sh". > > > > > > > > You also need to make sure your backend tasks see JRI library. there > > are > > > > multiple ways to do it i guess but for the purposes of testing the > > > > following just works for me in my mapred-site: > > > > > > > > <property> > > > > <name>mapred.child.java.opts</name> > > > > > > > > > > > > > > > > > > > <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri > > > > </value> > > > > <final>false</final> > > > > </property> > > > > > > > > > > > > I think at this point you guys might help me by doing review of that > > > stuff, > > > > asking questions and making suggestions how to go by incorporating > > other > > > > types of doFn and perhaps a way to complete the word count example, > > > perhaps > > > > running comparative benchmarks with a java-only word count, how much > > > > overhead we seem to be suffering here. > > > > > > > > I use StatEt in eclipse. Although it is a huge way forward, the > process > > > is > > > > still extremely tedious since I don't know unit testing framework in > R > > > well > > > > (so i just scribble some stuff on the side to unit-test this and > that) > > > and > > > > the integration test running cycle is significant enough. > > > > > > > > Which is why any help and suggestions are very welcome! > > > > > > > > I will definitely add support for reading/writing sequence files and > > > > Protobufs, as well as Mahout DRM's . > > > > > > > > > > > > Thanks. > > > > -Dmitrity > > > > > > > > > > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
