Re: Crunch R first milestone

Josh Wills Sat, 24 Nov 2012 10:30:06 -0800

Hey Dmitriy,

I'm up and running w/Example1.R on my Linux machine-- very cool! My Mac is
having some sort of issue w/creating /tmp/crunch* directories that I need
to sort out.


In the example you sent of the broken chaining of DoFns, why didn't the
first line (quoted below) require a PType? Is there a shortcut for the case
when the PType of the child is the same as the PType of the parent?

# wordsPCol <- inputPCol$parallelDo(
# function(line) emit( strsplit(tolower(line),"[^[:alnum:]]+")[[1]] )
# )

Josh



On Fri, Nov 23, 2012 at 1:59 PM, Dmitriy Lyubimov <[email protected]> wrote:

> ok support for PTable emission (key,value) pairs work in the latest commit.
>
> My current problem is that composition of doFunctions doesn't work,
> probably because of the sequence of cleanup() calls. I have to figure out:
>
> =============
> this composition of 2 functions (PCollection, PTable) is a problem
>
> # wordsPCol <- inputPCol$parallelDo(
> # function(line) emit( strsplit(tolower(line),"[^[:alnum:]]+")[[1]] )
> # )
> #
> # wordsPTab <- wordsPCol$parallelDo(function(word) emit(word,1),
> # keyType = crunchR.RString$new(),
> # valueType = crunchR.RUint32$new())
>
> but this equivalent works:
> wordsPTab <- inputPCol$parallelDo(
> function(line) {
> words<- strsplit(tolower(line),"[^[:alnum:]]+")[[1]]
> sapply(words, function(x) emit(x,1))
> },
> keyType = crunchR.RString$new(),
> valueType = crunchR.RUint32$new()
> )
>
>
>
> On Thu, Nov 22, 2012 at 2:13 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > Ok ,  I guess i am going to work on the next milestone which is
> PTableType
> > serialization support between R and java sides.
> >
> > once i am done with that, i guess i will be able to add other api and
> > complete word count example fairly easily.
> >
> > Example1.R in its current state works.
> >
> >
> > On Wed, Nov 21, 2012 at 12:11 PM, Josh Wills <[email protected]>
> wrote:
> >
> >> I'm going to play with this again over the break-- BTW, did you see
> >> Renjin?
> >> I somehow missed this, but it looks interesting.
> >>
> >> http://code.google.com/p/renjin/
> >>
> >>
> >> On Sun, Nov 18, 2012 at 11:44 AM, Dmitriy Lyubimov <[email protected]
> >> >wrote:
> >>
> >> > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <[email protected]>
> >> wrote:
> >> >
> >> > > Dmitrity,
> >> > >
> >> > > Just sent you a pull request based on playing with the code on OS X.
> >> It
> >> > > contains a README about my experience getting things working.
> >> > >
> >> >
> >> > Are you sure it is doxygen package? I thought it was roxygen2 package?
> >> >
> >> > Actually there seems currently no best practice in existence for R5
> >> classes
> >> > + roxygen2 (and the guy ignores @import order of files, too). Hence
> the
> >> > hacks with file names.
> >> >
> >> >
> >> > > Unfortunately, I haven't succeeded in getting crunchR loaded, I'm
> >> running
> >> > > into some issues w/RProtoBuf on OS X. I'll give it another go this
> >> week
> >> > on
> >> > > my Linux machine at work.
> >> > >
> >> > ok i removed @import RProtoBuf, you should be able to install w/o it.
> >> Maven
> >> > still compiles protoc  stuff though.
> >> >
> >> > >
> >> > > J
> >> > >
> >> > >
> >> > > On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <
> [email protected]
> >> > > >wrote:
> >> > >
> >> > > > Josh,
> >> > > >
> >> > > > ok the following commit
> >> > > >
> >> > > > ==============
> >> > > > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
> >> > > > Author: Dmitriy Lyubimov <[email protected]>
> >> > > > Date:   Sat Nov 17 12:29:27 2012 -0800
> >> > > >
> >> > > >     example1 succeeds
> >> > > >
> >> > > > ====================
> >> > > >
> >> > > > runs example 1 for me successfully in a fully distributed way
> which
> >> is
> >> > > > first step (map-only thing) for the word count.
> >> > > >
> >> > > > (I think there's a hickup somewhere here because in the output i
> >> also
> >> > > seem
> >> > > > to see some empty lines, so the strsplit() part is perhaps set up
> >> > > somewhat
> >> > > > incorrectly here, but it's not the point right now):
> >> > > >
> >> > > > ====Example1.R===========
> >> > > >
> >> > > > library(crunchR)
> >> > > >
> >> > > > pipeline <- crunchR.MRPipeline$new("test-pipeline")
> >> > > >
> >> > > > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
> >> > > >
> >> > > > outputPCol <- inputPCol$parallelDo(
> >> > > > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
> >> > > > )
> >> > > >
> >> > > > outputPCol$writeTextFile("/crunchr-examples/output")
> >> > > >
> >> > > > result <- pipeline$run()
> >> > > >
> >> > > > if ( !result$succeeded() ) stop ("pipeline failed.")
> >> > > >
> >> > > > ========================================
> >> > > >
> >> > > > I think R-java communication now should support multiple doFn ok
> and
> >> > they
> >> > > > will be properly shut down and executed and synchronized even if
> >> they
> >> > > emit
> >> > > > in the cleanup phase.
> >> > > >
> >> > > > This example assumes a lot of defaults (such as RTypes which are
> by
> >> > > default
> >> > > > character vector singleton in and character vector out for a
> DoFn).
> >> > Also
> >> > > > obviously uses text in-text out at this point only.
> >> > > >
> >> > > >
> >> > > > To run, install the package and upload the test input
> (test-prep.sh)
> >> > > > Assuming you have compiled the maven part, the R package snapshot
> >> could
> >> > > be
> >> > > > installed by running "install-snapshot-rpkg.sh".
> >> > > >
> >> > > > You also need to make sure your backend tasks see JRI library.
> there
> >> > are
> >> > > > multiple ways to do it i guess but for the purposes of testing the
> >> > > > following just works for me in my mapred-site:
> >> > > >
> >> > > > <property>
> >> > > >    <name>mapred.child.java.opts</name>
> >> > > >
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
>  
> <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
> >> > > > </value>
> >> > > >    <final>false</final>
> >> > > > </property>
> >> > > >
> >> > > >
> >> > > > I think at this point you guys might help me by doing review of
> that
> >> > > stuff,
> >> > > > asking questions and making suggestions how to go by incorporating
> >> > other
> >> > > > types of doFn and perhaps a way to complete the word count
> example,
> >> > > perhaps
> >> > > > running comparative benchmarks with a java-only word count, how
> much
> >> > > > overhead we seem to be suffering here.
> >> > > >
> >> > > > I use StatEt in eclipse. Although it is a huge way forward, the
> >> process
> >> > > is
> >> > > > still extremely tedious since I don't know unit testing framework
> >> in R
> >> > > well
> >> > > > (so i just scribble some stuff on the side to unit-test this and
> >> that)
> >> > > and
> >> > > > the integration test running cycle is significant enough.
> >> > > >
> >> > > > Which is why any help and suggestions are very welcome!
> >> > > >
> >> > > > I will definitely add support for reading/writing sequence files
> and
> >> > > > Protobufs, as well as Mahout DRM's .
> >> > > >
> >> > > >
> >> > > > Thanks.
> >> > > > -Dmitrity
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Director of Data Science
> >> Cloudera <http://www.cloudera.com>
> >> Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>
> >
> >
>

Re: Crunch R first milestone

Reply via email to