On Sun, Nov 18, 2012 at 10:13 AM, Dmitriy Lyubimov <[email protected]> wrote:
> Thank you, Josh. Your insights are greatly appreciated.
>
> RProtoBuf has a bug with <<- operator. I already contacted the authors and
> they confirmed it however it is not clear when they are going to fix it.
>
> (code to reproduce:
>> library(RProtoBuf)
>> a <<- "A"
> causes an error)
>
> Actually RProtoBuf is not used right now. I will move it into "recommended"
> realm again if it makes things easier.
>
> For me, the hardest part was to make jvm +hadoop to see JRI library
> actually. I am still not sure about the best course of action here but
> there is definitely more than one way
>
> Also my apologies for code styling, it is probably the ugliest code i've
> ever written, but i will tidy it up once past the proof of concept stage.

No judgements, man. You should have seen the first rev of Crunch. ;-)

>
> -d
>
>
> On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <[email protected]> wrote:
>
>> Dmitrity,
>>
>> Just sent you a pull request based on playing with the code on OS X. It
>> contains a README about my experience getting things working.
>>
>> Unfortunately, I haven't succeeded in getting crunchR loaded, I'm running
>> into some issues w/RProtoBuf on OS X. I'll give it another go this week on
>> my Linux machine at work.
>>
>> J
>>
>>
>> On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <[email protected]
>> >wrote:
>>
>> > Josh,
>> >
>> > ok the following commit
>> >
>> > ==============
>> > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
>> > Author: Dmitriy Lyubimov <[email protected]>
>> > Date:   Sat Nov 17 12:29:27 2012 -0800
>> >
>> >     example1 succeeds
>> >
>> > ====================
>> >
>> > runs example 1 for me successfully in a fully distributed way which is
>> > first step (map-only thing) for the word count.
>> >
>> > (I think there's a hickup somewhere here because in the output i also
>> seem
>> > to see some empty lines, so the strsplit() part is perhaps set up
>> somewhat
>> > incorrectly here, but it's not the point right now):
>> >
>> > ====Example1.R===========
>> >
>> > library(crunchR)
>> >
>> > pipeline <- crunchR.MRPipeline$new("test-pipeline")
>> >
>> > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
>> >
>> > outputPCol <- inputPCol$parallelDo(
>> > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
>> > )
>> >
>> > outputPCol$writeTextFile("/crunchr-examples/output")
>> >
>> > result <- pipeline$run()
>> >
>> > if ( !result$succeeded() ) stop ("pipeline failed.")
>> >
>> > ========================================
>> >
>> > I think R-java communication now should support multiple doFn ok and they
>> > will be properly shut down and executed and synchronized even if they
>> emit
>> > in the cleanup phase.
>> >
>> > This example assumes a lot of defaults (such as RTypes which are by
>> default
>> > character vector singleton in and character vector out for a DoFn). Also
>> > obviously uses text in-text out at this point only.
>> >
>> >
>> > To run, install the package and upload the test input (test-prep.sh)
>> > Assuming you have compiled the maven part, the R package snapshot could
>> be
>> > installed by running "install-snapshot-rpkg.sh".
>> >
>> > You also need to make sure your backend tasks see JRI library. there are
>> > multiple ways to do it i guess but for the purposes of testing the
>> > following just works for me in my mapred-site:
>> >
>> > <property>
>> >    <name>mapred.child.java.opts</name>
>> >
>> >
>> >
>>  
>> <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
>> > </value>
>> >    <final>false</final>
>> > </property>
>> >
>> >
>> > I think at this point you guys might help me by doing review of that
>> stuff,
>> > asking questions and making suggestions how to go by incorporating other
>> > types of doFn and perhaps a way to complete the word count example,
>> perhaps
>> > running comparative benchmarks with a java-only word count, how much
>> > overhead we seem to be suffering here.
>> >
>> > I use StatEt in eclipse. Although it is a huge way forward, the process
>> is
>> > still extremely tedious since I don't know unit testing framework in R
>> well
>> > (so i just scribble some stuff on the side to unit-test this and that)
>> and
>> > the integration test running cycle is significant enough.
>> >
>> > Which is why any help and suggestions are very welcome!
>> >
>> > I will definitely add support for reading/writing sequence files and
>> > Protobufs, as well as Mahout DRM's .
>> >
>> >
>> > Thanks.
>> > -Dmitrity
>> >
>>



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Reply via email to