Re: Crunch R first milestone

Josh Wills Tue, 27 Nov 2012 21:10:25 -0800

On Sat, Nov 24, 2012 at 10:17 PM, Dmitriy Lyubimov <[email protected]>wrote:


> it looks like easy and naive solution might be to detect whenever
> IntermediateEmitter is used in a function and serialize one to R side
> instead of actually using it on Java side.
>
> thoughts?
>

I can think of a few ways to do this, but none that I'm happy with yet. The
overhead of the R-Java bridge is certainly something we would like to do
away with when we can; the question is whether determining when to avoid it
should live on the Crunch side via the optimizer/runner, or whether we
should have the planner expose a data structure that explains the plan it
is going to use and allow the R side to use that plan to do the function
composition step itself before calling in to Crunch. That step could also
be useful in other environments-- I think Gabriel went at least partway
down this path already w/the pipeline visualization JIRA he did a few weeks
back.

The latter approach would mean that we would need to have RCrunch be it's
own (more R-like) wrapper around the underlying R-Java bridge, which also
has some appeal. But we'd need to play around with it a little and see what
it looked like.


>
> On Sat, Nov 24, 2012 at 5:05 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > Aha. I did a number of bug fixes so both examples (with 1 doFn and 2
> > doFn's ) are working now. Good.
> >
> > Josh, please read the comment to the second example. The intermediate
> > output of doFn #1 runs to java/Crunch and back just to be fed into doFn
> #2.
> > I would very much like to short-circuit those things on R side. Otherwise
> > it will be very hard to optimize multi-tenant applications (multiple
> > decoupled models encapsulated into bunch of doFn's distributed and
> > optimized by Crunch). Which is actually my pattern for production.
> >
> > I'd be eternally grateful if you could give it a thought. It may require
> > some exposure of Crunch optimizer internals IMO.
> >
> > thank you, sir.
> >
> >
> > On Sat, Nov 24, 2012 at 11:53 AM, Dmitriy Lyubimov <[email protected]
> >wrote:
> >
> >> also one obvious optimization is that if we somehow could perform
> >> extraction of DoFn's DAG for a particular task, we could re-connect that
> >> DAG on the R side instead of piping that data back and forth from R to
> java
> >> DAG of doFn's. But i would need a help from somebody with deep inner
> >> knowledge of Crunch optimizer to extract and materialize such DAGs of
> >> functions on the R side.
> >>
> >>
> >> On Sat, Nov 24, 2012 at 11:13 AM, Dmitriy Lyubimov <[email protected]
> >wrote:
> >>
> >>> Another perhaps useful piece of information is that process, initialize
> >>> and cleanup R closures may share the same environment and this is
> handled
> >>> corerctly at the backend, e.g.
> >>>
> >>> createClosures <- function () {
> >>>    x <- 0, y<- 0
> >>>   startup <- function () x <<- 1
> >>>   process <- function(value) y <<- ifelse(x==1,2,0)
> >>>   cleanup <- function() emit(x + y)
> >>>
> >>>   list(process,startup,cleanup)
> >>> }
> >>>
> >>> this function will produce 3 closures that share same environment
> >>> containing x and y and each task at backend should emit value 3.
> >>>
> >>>
> >>> On Sat, Nov 24, 2012 at 10:54 AM, Dmitriy Lyubimov <[email protected]
> >wrote:
> >>>
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Nov 24, 2012 at 10:29 AM, Josh Wills <[email protected]
> >wrote:
> >>>>
> >>>>> Hey Dmitriy,
> >>>>>
> >>>>> I'm up and running w/Example1.R on my Linux machine-- very cool! My
> >>>>> Mac is
> >>>>> having some sort of issue w/creating /tmp/crunch* directories that I
> >>>>> need
> >>>>> to sort out.
> >>>>>
> >>>>> In the example you sent of the broken chaining of DoFns, why didn't
> the
> >>>>> first line (quoted below) require a PType?
> >>>>
> >>>>
> >>>> Because the implementation assumes a default  type which is character
> >>>> vector as below. Also, if it detects
> >>>> that key type was specified explicitly, it returns PTable
> automatically
> >>>> instead of PCollection.
> >>>>
> >>>> Further on, PTable's emits automatically assume emit(key,value)
> >>>> invocation for concise of notation (instead of java's
> Pair.of(key,value) )
> >>>> and PCollections assume just emit(value).
> >>>>
> >>>>  parallelDo = function ( FUN_PROCESS,
> >>>> FUN_INITIALIZE=NULL,FUN_CLEANUP=NULL,
> >>>> valueType=crunchR.RStrings$new(), keyType) {
> >>>>  if (missing(keyType)) {
> >>>>
> >>>>
> .parallelDo.PCollection(FUN_PROCESS,FUN_INITIALIZE,FUN_CLEANUP,valueType)
> >>>>  } else {
> >>>>
> >>>>
> .parallelDo.PTable(FUN_PROCESS,FUN_INITIALIZE,FUN_CLEANUP,keyType,valueType)
> >>>> }
> >>>>  },
> >>>>
> >>>>
> >>>>
> >>>> Is there a shortcut for the case
> >>>>> when the PType of the child is the same as the PType of the parent?
> >>>>>
> >>>>
> >>>> er... no. it kind of always assume RStrings (which assumes
> >>>> PType<String> but corresponding R type is multi-emit, i.e. you can
> emit a
> >>>> vector once and internally it will translate into bunch of calls of
> >>>> emit(String). This is a notion that i made specifically for R since R
> >>>> operates with vectors and lists, so i can emit just one vector type
> and
> >>>> declare it a multi-emit. It is not clear to me if this notion will
> have a
> >>>> benefit. Obviously, you still can emit R character vector as a single
> >>>> value, too, but you would have to select different RType thing there
> to
> >>>> imply your intent.
> >>>>
> >>>> Word count is a good example where multi-emit RType serves you well:
> >>>> you output result of split[[1]] which is a character vector as one R
> call
> >>>> emit(split...) but it translates into bunch of individual emits (the
> >>>> variant i had before this last one with PTable, or the one commented
> one
> >>>> here :
> >>>>
> >>>> # wordsPCol <- inputPCol$parallelDo(
> >>>> > # function(line) emit( strsplit(tolower(line),"[^[:alnum:]]+")[[1]]
> )
> >>>> > # )
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> # wordsPCol <- inputPCol$parallelDo(
> >>>>> # function(line) emit( strsplit(tolower(line),"[^[:alnum:]]+")[[1]] )
> >>>>> # )
> >>>>>
> >>>>> Josh
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Nov 23, 2012 at 1:59 PM, Dmitriy Lyubimov <[email protected]
> >
> >>>>> wrote:
> >>>>>
> >>>>> > ok support for PTable emission (key,value) pairs work in the latest
> >>>>> commit.
> >>>>> >
> >>>>> > My current problem is that composition of doFunctions doesn't work,
> >>>>> > probably because of the sequence of cleanup() calls. I have to
> >>>>> figure out:
> >>>>> >
> >>>>> > =============
> >>>>> > this composition of 2 functions (PCollection, PTable) is a problem
> >>>>> >
> >>>>> > # wordsPCol <- inputPCol$parallelDo(
> >>>>> > # function(line) emit(
> strsplit(tolower(line),"[^[:alnum:]]+")[[1]] )
> >>>>> > # )
> >>>>> > #
> >>>>> > # wordsPTab <- wordsPCol$parallelDo(function(word) emit(word,1),
> >>>>> > # keyType = crunchR.RString$new(),
> >>>>> > # valueType = crunchR.RUint32$new())
> >>>>> >
> >>>>> > but this equivalent works:
> >>>>> > wordsPTab <- inputPCol$parallelDo(
> >>>>> > function(line) {
> >>>>> > words<- strsplit(tolower(line),"[^[:alnum:]]+")[[1]]
> >>>>> > sapply(words, function(x) emit(x,1))
> >>>>> > },
> >>>>> > keyType = crunchR.RString$new(),
> >>>>> > valueType = crunchR.RUint32$new()
> >>>>> > )
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > On Thu, Nov 22, 2012 at 2:13 PM, Dmitriy Lyubimov <
> [email protected]
> >>>>> >
> >>>>> > wrote:
> >>>>> >
> >>>>> > > Ok ,  I guess i am going to work on the next milestone which is
> >>>>> > PTableType
> >>>>> > > serialization support between R and java sides.
> >>>>> > >
> >>>>> > > once i am done with that, i guess i will be able to add other api
> >>>>> and
> >>>>> > > complete word count example fairly easily.
> >>>>> > >
> >>>>> > > Example1.R in its current state works.
> >>>>> > >
> >>>>> > >
> >>>>> > > On Wed, Nov 21, 2012 at 12:11 PM, Josh Wills <
> [email protected]>
> >>>>> > wrote:
> >>>>> > >
> >>>>> > >> I'm going to play with this again over the break-- BTW, did you
> >>>>> see
> >>>>> > >> Renjin?
> >>>>> > >> I somehow missed this, but it looks interesting.
> >>>>> > >>
> >>>>> > >> http://code.google.com/p/renjin/
> >>>>> > >>
> >>>>> > >>
> >>>>> > >> On Sun, Nov 18, 2012 at 11:44 AM, Dmitriy Lyubimov <
> >>>>> [email protected]
> >>>>> > >> >wrote:
> >>>>> > >>
> >>>>> > >> > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <
> >>>>> [email protected]>
> >>>>> > >> wrote:
> >>>>> > >> >
> >>>>> > >> > > Dmitrity,
> >>>>> > >> > >
> >>>>> > >> > > Just sent you a pull request based on playing with the code
> >>>>> on OS X.
> >>>>> > >> It
> >>>>> > >> > > contains a README about my experience getting things
> working.
> >>>>> > >> > >
> >>>>> > >> >
> >>>>> > >> > Are you sure it is doxygen package? I thought it was roxygen2
> >>>>> package?
> >>>>> > >> >
> >>>>> > >> > Actually there seems currently no best practice in existence
> >>>>> for R5
> >>>>> > >> classes
> >>>>> > >> > + roxygen2 (and the guy ignores @import order of files, too).
> >>>>> Hence
> >>>>> > the
> >>>>> > >> > hacks with file names.
> >>>>> > >> >
> >>>>> > >> >
> >>>>> > >> > > Unfortunately, I haven't succeeded in getting crunchR
> loaded,
> >>>>> I'm
> >>>>> > >> running
> >>>>> > >> > > into some issues w/RProtoBuf on OS X. I'll give it another
> go
> >>>>> this
> >>>>> > >> week
> >>>>> > >> > on
> >>>>> > >> > > my Linux machine at work.
> >>>>> > >> > >
> >>>>> > >> > ok i removed @import RProtoBuf, you should be able to install
> >>>>> w/o it.
> >>>>> > >> Maven
> >>>>> > >> > still compiles protoc  stuff though.
> >>>>> > >> >
> >>>>> > >> > >
> >>>>> > >> > > J
> >>>>> > >> > >
> >>>>> > >> > >
> >>>>> > >> > > On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <
> >>>>> > [email protected]
> >>>>> > >> > > >wrote:
> >>>>> > >> > >
> >>>>> > >> > > > Josh,
> >>>>> > >> > > >
> >>>>> > >> > > > ok the following commit
> >>>>> > >> > > >
> >>>>> > >> > > > ==============
> >>>>> > >> > > > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
> >>>>> > >> > > > Author: Dmitriy Lyubimov <[email protected]>
> >>>>> > >> > > > Date:   Sat Nov 17 12:29:27 2012 -0800
> >>>>> > >> > > >
> >>>>> > >> > > >     example1 succeeds
> >>>>> > >> > > >
> >>>>> > >> > > > ====================
> >>>>> > >> > > >
> >>>>> > >> > > > runs example 1 for me successfully in a fully distributed
> >>>>> way
> >>>>> > which
> >>>>> > >> is
> >>>>> > >> > > > first step (map-only thing) for the word count.
> >>>>> > >> > > >
> >>>>> > >> > > > (I think there's a hickup somewhere here because in the
> >>>>> output i
> >>>>> > >> also
> >>>>> > >> > > seem
> >>>>> > >> > > > to see some empty lines, so the strsplit() part is perhaps
> >>>>> set up
> >>>>> > >> > > somewhat
> >>>>> > >> > > > incorrectly here, but it's not the point right now):
> >>>>> > >> > > >
> >>>>> > >> > > > ====Example1.R===========
> >>>>> > >> > > >
> >>>>> > >> > > > library(crunchR)
> >>>>> > >> > > >
> >>>>> > >> > > > pipeline <- crunchR.MRPipeline$new("test-pipeline")
> >>>>> > >> > > >
> >>>>> > >> > > > inputPCol <-
> >>>>> pipeline$readTextFile("/crunchr-examples/input")
> >>>>> > >> > > >
> >>>>> > >> > > > outputPCol <- inputPCol$parallelDo(
> >>>>> > >> > > > function(line) emit(
> >>>>> strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
> >>>>> > >> > > > )
> >>>>> > >> > > >
> >>>>> > >> > > > outputPCol$writeTextFile("/crunchr-examples/output")
> >>>>> > >> > > >
> >>>>> > >> > > > result <- pipeline$run()
> >>>>> > >> > > >
> >>>>> > >> > > > if ( !result$succeeded() ) stop ("pipeline failed.")
> >>>>> > >> > > >
> >>>>> > >> > > > ========================================
> >>>>> > >> > > >
> >>>>> > >> > > > I think R-java communication now should support multiple
> >>>>> doFn ok
> >>>>> > and
> >>>>> > >> > they
> >>>>> > >> > > > will be properly shut down and executed and synchronized
> >>>>> even if
> >>>>> > >> they
> >>>>> > >> > > emit
> >>>>> > >> > > > in the cleanup phase.
> >>>>> > >> > > >
> >>>>> > >> > > > This example assumes a lot of defaults (such as RTypes
> >>>>> which are
> >>>>> > by
> >>>>> > >> > > default
> >>>>> > >> > > > character vector singleton in and character vector out
> for a
> >>>>> > DoFn).
> >>>>> > >> > Also
> >>>>> > >> > > > obviously uses text in-text out at this point only.
> >>>>> > >> > > >
> >>>>> > >> > > >
> >>>>> > >> > > > To run, install the package and upload the test input
> >>>>> > (test-prep.sh)
> >>>>> > >> > > > Assuming you have compiled the maven part, the R package
> >>>>> snapshot
> >>>>> > >> could
> >>>>> > >> > > be
> >>>>> > >> > > > installed by running "install-snapshot-rpkg.sh".
> >>>>> > >> > > >
> >>>>> > >> > > > You also need to make sure your backend tasks see JRI
> >>>>> library.
> >>>>> > there
> >>>>> > >> > are
> >>>>> > >> > > > multiple ways to do it i guess but for the purposes of
> >>>>> testing the
> >>>>> > >> > > > following just works for me in my mapred-site:
> >>>>> > >> > > >
> >>>>> > >> > > > <property>
> >>>>> > >> > > >    <name>mapred.child.java.opts</name>
> >>>>> > >> > > >
> >>>>> > >> > > >
> >>>>> > >> > > >
> >>>>> > >> > >
> >>>>> > >> >
> >>>>> > >>
> >>>>> >
> >>>>>
>  
> <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
> >>>>> > >> > > > </value>
> >>>>> > >> > > >    <final>false</final>
> >>>>> > >> > > > </property>
> >>>>> > >> > > >
> >>>>> > >> > > >
> >>>>> > >> > > > I think at this point you guys might help me by doing
> >>>>> review of
> >>>>> > that
> >>>>> > >> > > stuff,
> >>>>> > >> > > > asking questions and making suggestions how to go by
> >>>>> incorporating
> >>>>> > >> > other
> >>>>> > >> > > > types of doFn and perhaps a way to complete the word count
> >>>>> > example,
> >>>>> > >> > > perhaps
> >>>>> > >> > > > running comparative benchmarks with a java-only word
> count,
> >>>>> how
> >>>>> > much
> >>>>> > >> > > > overhead we seem to be suffering here.
> >>>>> > >> > > >
> >>>>> > >> > > > I use StatEt in eclipse. Although it is a huge way
> forward,
> >>>>> the
> >>>>> > >> process
> >>>>> > >> > > is
> >>>>> > >> > > > still extremely tedious since I don't know unit testing
> >>>>> framework
> >>>>> > >> in R
> >>>>> > >> > > well
> >>>>> > >> > > > (so i just scribble some stuff on the side to unit-test
> >>>>> this and
> >>>>> > >> that)
> >>>>> > >> > > and
> >>>>> > >> > > > the integration test running cycle is significant enough.
> >>>>> > >> > > >
> >>>>> > >> > > > Which is why any help and suggestions are very welcome!
> >>>>> > >> > > >
> >>>>> > >> > > > I will definitely add support for reading/writing sequence
> >>>>> files
> >>>>> > and
> >>>>> > >> > > > Protobufs, as well as Mahout DRM's .
> >>>>> > >> > > >
> >>>>> > >> > > >
> >>>>> > >> > > > Thanks.
> >>>>> > >> > > > -Dmitrity
> >>>>> > >> > > >
> >>>>> > >> > >
> >>>>> > >> >
> >>>>> > >>
> >>>>> > >>
> >>>>> > >>
> >>>>> > >> --
> >>>>> > >> Director of Data Science
> >>>>> > >> Cloudera <http://www.cloudera.com>
> >>>>> > >> Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>>>> > >>
> >>>>> > >
> >>>>> > >
> >>>>> >
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>

Re: Crunch R first milestone

Reply via email to