ok support for PTable emission (key,value) pairs work in the latest commit.

My current problem is that composition of doFunctions doesn't work,
probably because of the sequence of cleanup() calls. I have to figure out:

=============
this composition of 2 functions (PCollection, PTable) is a problem

# wordsPCol <- inputPCol$parallelDo(
# function(line) emit( strsplit(tolower(line),"[^[:alnum:]]+")[[1]] )
# )
#
# wordsPTab <- wordsPCol$parallelDo(function(word) emit(word,1),
# keyType = crunchR.RString$new(),
# valueType = crunchR.RUint32$new())

but this equivalent works:
wordsPTab <- inputPCol$parallelDo(
function(line) {
words<- strsplit(tolower(line),"[^[:alnum:]]+")[[1]]
sapply(words, function(x) emit(x,1))
},
keyType = crunchR.RString$new(),
valueType = crunchR.RUint32$new()
)



On Thu, Nov 22, 2012 at 2:13 PM, Dmitriy Lyubimov <[email protected]> wrote:

> Ok ,  I guess i am going to work on the next milestone which is PTableType
> serialization support between R and java sides.
>
> once i am done with that, i guess i will be able to add other api and
> complete word count example fairly easily.
>
> Example1.R in its current state works.
>
>
> On Wed, Nov 21, 2012 at 12:11 PM, Josh Wills <[email protected]> wrote:
>
>> I'm going to play with this again over the break-- BTW, did you see
>> Renjin?
>> I somehow missed this, but it looks interesting.
>>
>> http://code.google.com/p/renjin/
>>
>>
>> On Sun, Nov 18, 2012 at 11:44 AM, Dmitriy Lyubimov <[email protected]
>> >wrote:
>>
>> > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <[email protected]>
>> wrote:
>> >
>> > > Dmitrity,
>> > >
>> > > Just sent you a pull request based on playing with the code on OS X.
>> It
>> > > contains a README about my experience getting things working.
>> > >
>> >
>> > Are you sure it is doxygen package? I thought it was roxygen2 package?
>> >
>> > Actually there seems currently no best practice in existence for R5
>> classes
>> > + roxygen2 (and the guy ignores @import order of files, too). Hence the
>> > hacks with file names.
>> >
>> >
>> > > Unfortunately, I haven't succeeded in getting crunchR loaded, I'm
>> running
>> > > into some issues w/RProtoBuf on OS X. I'll give it another go this
>> week
>> > on
>> > > my Linux machine at work.
>> > >
>> > ok i removed @import RProtoBuf, you should be able to install w/o it.
>> Maven
>> > still compiles protoc  stuff though.
>> >
>> > >
>> > > J
>> > >
>> > >
>> > > On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <[email protected]
>> > > >wrote:
>> > >
>> > > > Josh,
>> > > >
>> > > > ok the following commit
>> > > >
>> > > > ==============
>> > > > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
>> > > > Author: Dmitriy Lyubimov <[email protected]>
>> > > > Date:   Sat Nov 17 12:29:27 2012 -0800
>> > > >
>> > > >     example1 succeeds
>> > > >
>> > > > ====================
>> > > >
>> > > > runs example 1 for me successfully in a fully distributed way which
>> is
>> > > > first step (map-only thing) for the word count.
>> > > >
>> > > > (I think there's a hickup somewhere here because in the output i
>> also
>> > > seem
>> > > > to see some empty lines, so the strsplit() part is perhaps set up
>> > > somewhat
>> > > > incorrectly here, but it's not the point right now):
>> > > >
>> > > > ====Example1.R===========
>> > > >
>> > > > library(crunchR)
>> > > >
>> > > > pipeline <- crunchR.MRPipeline$new("test-pipeline")
>> > > >
>> > > > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
>> > > >
>> > > > outputPCol <- inputPCol$parallelDo(
>> > > > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
>> > > > )
>> > > >
>> > > > outputPCol$writeTextFile("/crunchr-examples/output")
>> > > >
>> > > > result <- pipeline$run()
>> > > >
>> > > > if ( !result$succeeded() ) stop ("pipeline failed.")
>> > > >
>> > > > ========================================
>> > > >
>> > > > I think R-java communication now should support multiple doFn ok and
>> > they
>> > > > will be properly shut down and executed and synchronized even if
>> they
>> > > emit
>> > > > in the cleanup phase.
>> > > >
>> > > > This example assumes a lot of defaults (such as RTypes which are by
>> > > default
>> > > > character vector singleton in and character vector out for a DoFn).
>> > Also
>> > > > obviously uses text in-text out at this point only.
>> > > >
>> > > >
>> > > > To run, install the package and upload the test input (test-prep.sh)
>> > > > Assuming you have compiled the maven part, the R package snapshot
>> could
>> > > be
>> > > > installed by running "install-snapshot-rpkg.sh".
>> > > >
>> > > > You also need to make sure your backend tasks see JRI library. there
>> > are
>> > > > multiple ways to do it i guess but for the purposes of testing the
>> > > > following just works for me in my mapred-site:
>> > > >
>> > > > <property>
>> > > >    <name>mapred.child.java.opts</name>
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>  
>> <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
>> > > > </value>
>> > > >    <final>false</final>
>> > > > </property>
>> > > >
>> > > >
>> > > > I think at this point you guys might help me by doing review of that
>> > > stuff,
>> > > > asking questions and making suggestions how to go by incorporating
>> > other
>> > > > types of doFn and perhaps a way to complete the word count example,
>> > > perhaps
>> > > > running comparative benchmarks with a java-only word count, how much
>> > > > overhead we seem to be suffering here.
>> > > >
>> > > > I use StatEt in eclipse. Although it is a huge way forward, the
>> process
>> > > is
>> > > > still extremely tedious since I don't know unit testing framework
>> in R
>> > > well
>> > > > (so i just scribble some stuff on the side to unit-test this and
>> that)
>> > > and
>> > > > the integration test running cycle is significant enough.
>> > > >
>> > > > Which is why any help and suggestions are very welcome!
>> > > >
>> > > > I will definitely add support for reading/writing sequence files and
>> > > > Protobufs, as well as Mahout DRM's .
>> > > >
>> > > >
>> > > > Thanks.
>> > > > -Dmitrity
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Reply via email to