Re: Crunch R first milestone

Dmitriy Lyubimov Fri, 23 Nov 2012 13:59:37 -0800

ok support for PTable emission (key,value) pairs work in the latest commit.


My current problem is that composition of doFunctions doesn't work,
probably because of the sequence of cleanup() calls. I have to figure out:

=============
this composition of 2 functions (PCollection, PTable) is a problem

# wordsPCol <- inputPCol$parallelDo(
# function(line) emit( strsplit(tolower(line),"[^[:alnum:]]+")[[1]] )
# )
#
# wordsPTab <- wordsPCol$parallelDo(function(word) emit(word,1),
# keyType = crunchR.RString$new(),
# valueType = crunchR.RUint32$new())

but this equivalent works:
wordsPTab <- inputPCol$parallelDo(
function(line) {
words<- strsplit(tolower(line),"[^[:alnum:]]+")[[1]]
sapply(words, function(x) emit(x,1))
},
keyType = crunchR.RString$new(),
valueType = crunchR.RUint32$new()
)



On Thu, Nov 22, 2012 at 2:13 PM, Dmitriy Lyubimov <[email protected]> wrote:

> Ok ,  I guess i am going to work on the next milestone which is PTableType
> serialization support between R and java sides.
>
> once i am done with that, i guess i will be able to add other api and
> complete word count example fairly easily.
>
> Example1.R in its current state works.
>
>
> On Wed, Nov 21, 2012 at 12:11 PM, Josh Wills <[email protected]> wrote:
>
>> I'm going to play with this again over the break-- BTW, did you see
>> Renjin?
>> I somehow missed this, but it looks interesting.
>>
>> http://code.google.com/p/renjin/
>>
>>
>> On Sun, Nov 18, 2012 at 11:44 AM, Dmitriy Lyubimov <[email protected]
>> >wrote:
>>
>> > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <[email protected]>
>> wrote:
>> >
>> > > Dmitrity,
>> > >
>> > > Just sent you a pull request based on playing with the code on OS X.
>> It
>> > > contains a README about my experience getting things working.
>> > >
>> >
>> > Are you sure it is doxygen package? I thought it was roxygen2 package?
>> >
>> > Actually there seems currently no best practice in existence for R5
>> classes
>> > + roxygen2 (and the guy ignores @import order of files, too). Hence the
>> > hacks with file names.
>> >
>> >
>> > > Unfortunately, I haven't succeeded in getting crunchR loaded, I'm
>> running
>> > > into some issues w/RProtoBuf on OS X. I'll give it another go this
>> week
>> > on
>> > > my Linux machine at work.
>> > >
>> > ok i removed @import RProtoBuf, you should be able to install w/o it.
>> Maven
>> > still compiles protoc  stuff though.
>> >
>> > >
>> > > J
>> > >
>> > >
>> > > On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <[email protected]
>> > > >wrote:
>> > >
>> > > > Josh,
>> > > >
>> > > > ok the following commit
>> > > >
>> > > > ==============
>> > > > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
>> > > > Author: Dmitriy Lyubimov <[email protected]>
>> > > > Date:   Sat Nov 17 12:29:27 2012 -0800
>> > > >
>> > > >     example1 succeeds
>> > > >
>> > > > ====================
>> > > >
>> > > > runs example 1 for me successfully in a fully distributed way which
>> is
>> > > > first step (map-only thing) for the word count.
>> > > >
>> > > > (I think there's a hickup somewhere here because in the output i
>> also
>> > > seem
>> > > > to see some empty lines, so the strsplit() part is perhaps set up
>> > > somewhat
>> > > > incorrectly here, but it's not the point right now):
>> > > >
>> > > > ====Example1.R===========
>> > > >
>> > > > library(crunchR)
>> > > >
>> > > > pipeline <- crunchR.MRPipeline$new("test-pipeline")
>> > > >
>> > > > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
>> > > >
>> > > > outputPCol <- inputPCol$parallelDo(
>> > > > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
>> > > > )
>> > > >
>> > > > outputPCol$writeTextFile("/crunchr-examples/output")
>> > > >
>> > > > result <- pipeline$run()
>> > > >
>> > > > if ( !result$succeeded() ) stop ("pipeline failed.")
>> > > >
>> > > > ========================================
>> > > >
>> > > > I think R-java communication now should support multiple doFn ok and
>> > they
>> > > > will be properly shut down and executed and synchronized even if
>> they
>> > > emit
>> > > > in the cleanup phase.
>> > > >
>> > > > This example assumes a lot of defaults (such as RTypes which are by
>> > > default
>> > > > character vector singleton in and character vector out for a DoFn).
>> > Also
>> > > > obviously uses text in-text out at this point only.
>> > > >
>> > > >
>> > > > To run, install the package and upload the test input (test-prep.sh)
>> > > > Assuming you have compiled the maven part, the R package snapshot
>> could
>> > > be
>> > > > installed by running "install-snapshot-rpkg.sh".
>> > > >
>> > > > You also need to make sure your backend tasks see JRI library. there
>> > are
>> > > > multiple ways to do it i guess but for the purposes of testing the
>> > > > following just works for me in my mapred-site:
>> > > >
>> > > > <property>
>> > > >    <name>mapred.child.java.opts</name>
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>  
>> <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
>> > > > </value>
>> > > >    <final>false</final>
>> > > > </property>
>> > > >
>> > > >
>> > > > I think at this point you guys might help me by doing review of that
>> > > stuff,
>> > > > asking questions and making suggestions how to go by incorporating
>> > other
>> > > > types of doFn and perhaps a way to complete the word count example,
>> > > perhaps
>> > > > running comparative benchmarks with a java-only word count, how much
>> > > > overhead we seem to be suffering here.
>> > > >
>> > > > I use StatEt in eclipse. Although it is a huge way forward, the
>> process
>> > > is
>> > > > still extremely tedious since I don't know unit testing framework
>> in R
>> > > well
>> > > > (so i just scribble some stuff on the side to unit-test this and
>> that)
>> > > and
>> > > > the integration test running cycle is significant enough.
>> > > >
>> > > > Which is why any help and suggestions are very welcome!
>> > > >
>> > > > I will definitely add support for reading/writing sequence files and
>> > > > Protobufs, as well as Mahout DRM's .
>> > > >
>> > > >
>> > > > Thanks.
>> > > > -Dmitrity
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Re: Crunch R first milestone

Reply via email to