Re: [R] Using R for Production - Discussion

2010-11-02 Thread baptiste auguie
Hi,

Regarding your '10 commandments' in Q3, you may find useful tips in
the R inferno by Pat Burns.

HTH,

baptiste

On 2 November 2010 05:04, Santosh Srinivas santosh.srini...@gmail.com wrote:
 Hello Group,

 This is an open-ended question.

 Quite fascinated by the things I can do and the control I have on my
 activities since I started using R.
 I basically have been using this for analytical related work off my desktop.
 My experience has been quite good and most issues where I need to
 investigate and solve are typical items more related to data errors, format
 corruption, etc... not necessarily R Related.

 Complementing this with Python gives enough firepower to do lots of
 production (analytical related activities) on the cloud (from my research I
 see that every innovative technology provider seems to support Python ...
 google, amazon, etc).

 Question on using R for Production activities:
 Q1) Does anyone have experience of using R-scripts etc ... for production
 related activities. E.g. serving off a computational/ analytical /
 simulation environment from a webportal with the analytical processing done
 in R.
 I've seen that most useful things for normal (not rocket science) business
 (80-20 rule) can be done just as well in R in comparison with tools like
 SAS, Matlab, etc.

 Q2) I haven't tried the processing routines for much larger data-sets
 assuming size is not a constraint nowadays.
 I know that I should try out ... but any forewarnings would help. Is it
 likely that something that works for my desktop dataset is quite as likely
 to work when scaled up to a cloud dataset?
 Assuming that I do the clearing out of unused objects, not running into
 infinite loops, etc?

 i.e. is there any problem with the fundamental architecture of R itself?
 (like press articles often say)


 Q3) There are big fans of the SAS, Matlab, Mathworks environments out there
  does anyone have a comparison of how R fares.
 From my experience R is quite neat and low level ... so overheads should be
 quite low.
 Most slowness comes due to lack of knowledge (see my code ... like using the
 wrong structures, functions, loops, etc.) rather than something wrong with
 the way R itself is.
 Perhaps there is no commercial focus to enhance performance related issues
 but my guess is that it is just matter of time till the community evolves
 the language to score higher on that too.
 And perhaps develops documentation to assist the challenge users with
 performance tips (the ten commandments types)

 Q4) You must have heard about the latest comment from James Goodnight of SAS
 ... We haven't noticed that a lot. Most of our companies need industrial
 strength software that has been tested, put through every possible scenario
 or failure to make sure everything works correctly.
 My gut is that random passionate geeks (playing part-time) do better
 testing than a military of professionals ... (but I've no empirical evidence
 here)

 I am not taking a side here (although I appreciate those who do!) .. but
 looking for an objective reasoning.

 Thanks,
 S

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using R for Production - Discussion

2010-11-02 Thread Douglas Bates
On Mon, Nov 1, 2010 at 11:04 PM, Santosh Srinivas
santosh.srini...@gmail.com wrote:
 Hello Group,

 This is an open-ended question.

 Quite fascinated by the things I can do and the control I have on my
 activities since I started using R.
 I basically have been using this for analytical related work off my desktop.
 My experience has been quite good and most issues where I need to
 investigate and solve are typical items more related to data errors, format
 corruption, etc... not necessarily R Related.

 Complementing this with Python gives enough firepower to do lots of
 production (analytical related activities) on the cloud (from my research I
 see that every innovative technology provider seems to support Python ...
 google, amazon, etc).

 Question on using R for Production activities:
 Q1) Does anyone have experience of using R-scripts etc ... for production
 related activities. E.g. serving off a computational/ analytical /
 simulation environment from a webportal with the analytical processing done
 in R.
 I've seen that most useful things for normal (not rocket science) business
 (80-20 rule) can be done just as well in R in comparison with tools like
 SAS, Matlab, etc.

 Q2) I haven't tried the processing routines for much larger data-sets
 assuming size is not a constraint nowadays.
 I know that I should try out ... but any forewarnings would help. Is it
 likely that something that works for my desktop dataset is quite as likely
 to work when scaled up to a cloud dataset?
 Assuming that I do the clearing out of unused objects, not running into
 infinite loops, etc?

 i.e. is there any problem with the fundamental architecture of R itself?
 (like press articles often say)


 Q3) There are big fans of the SAS, Matlab, Mathworks environments out there
  does anyone have a comparison of how R fares.
 From my experience R is quite neat and low level ... so overheads should be
 quite low.
 Most slowness comes due to lack of knowledge (see my code ... like using the
 wrong structures, functions, loops, etc.) rather than something wrong with
 the way R itself is.
 Perhaps there is no commercial focus to enhance performance related issues
 but my guess is that it is just matter of time till the community evolves
 the language to score higher on that too.
 And perhaps develops documentation to assist the challenge users with
 performance tips (the ten commandments types)

 Q4) You must have heard about the latest comment from James Goodnight of SAS
 ... We haven't noticed that a lot. Most of our companies need industrial
 strength software that has been tested, put through every possible scenario
 or failure to make sure everything works correctly.
 My gut is that random passionate geeks (playing part-time) do better
 testing than a military of professionals ... (but I've no empirical evidence
 here)

 I am not taking a side here (although I appreciate those who do!) .. but
 looking for an objective reasoning.

Regarding performance and size of data sets I would suggest viewing
the presentation that Dirk Eddelbuettel and Romain Francois gave at
Google recently.  David Smith links to it in his blog at
blog.revolutionanalytics.com

One of the advantages of Open Source systems is that people can
provide many different kinds of hooks into the code.

At present any R vector objects use 32-bit signed integers for
indexing, which limits the size of an individual vector to 2^{31}-1.
There are some methods available for using external storage to by-pass
this but they do introduce another level of complexity.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using R for Production - Discussion

2010-11-02 Thread Saeed Abu Nimeh
I worked on a project where we used a random forest classifier to
predict a binary response. We trained a model in the ec2 cloud with 3
million observations and 44 features. We stored the model that was
generated by R using save(mymodel,file=model.Rdata). Now we use
model.Rdata locally to predict new observations.
In our local system, we built a parser in Perl to generate the csv
representation of the observation we want to predict, then we used
RSPerl to communicate between Perl and R. But there is a catch,
instead of loading the random forest model (model.Rdata) every time we
want to predict a new observation, we have an R console running as a
daemon with the model.Rdata loaded already. Then, we send the
observation to be predicted from Perl to R. If anyone else has better
solutions/ideas, please feel free to share.
Thanks,
Saeed

On Mon, Nov 1, 2010 at 9:04 PM, Santosh Srinivas
santosh.srini...@gmail.com wrote:
 Hello Group,

 This is an open-ended question.

 Quite fascinated by the things I can do and the control I have on my
 activities since I started using R.
 I basically have been using this for analytical related work off my desktop.
 My experience has been quite good and most issues where I need to
 investigate and solve are typical items more related to data errors, format
 corruption, etc... not necessarily R Related.

 Complementing this with Python gives enough firepower to do lots of
 production (analytical related activities) on the cloud (from my research I
 see that every innovative technology provider seems to support Python ...
 google, amazon, etc).

 Question on using R for Production activities:
 Q1) Does anyone have experience of using R-scripts etc ... for production
 related activities. E.g. serving off a computational/ analytical /
 simulation environment from a webportal with the analytical processing done
 in R.
 I've seen that most useful things for normal (not rocket science) business
 (80-20 rule) can be done just as well in R in comparison with tools like
 SAS, Matlab, etc.

 Q2) I haven't tried the processing routines for much larger data-sets
 assuming size is not a constraint nowadays.
 I know that I should try out ... but any forewarnings would help. Is it
 likely that something that works for my desktop dataset is quite as likely
 to work when scaled up to a cloud dataset?
 Assuming that I do the clearing out of unused objects, not running into
 infinite loops, etc?

 i.e. is there any problem with the fundamental architecture of R itself?
 (like press articles often say)


 Q3) There are big fans of the SAS, Matlab, Mathworks environments out there
  does anyone have a comparison of how R fares.
 From my experience R is quite neat and low level ... so overheads should be
 quite low.
 Most slowness comes due to lack of knowledge (see my code ... like using the
 wrong structures, functions, loops, etc.) rather than something wrong with
 the way R itself is.
 Perhaps there is no commercial focus to enhance performance related issues
 but my guess is that it is just matter of time till the community evolves
 the language to score higher on that too.
 And perhaps develops documentation to assist the challenge users with
 performance tips (the ten commandments types)

 Q4) You must have heard about the latest comment from James Goodnight of SAS
 ... We haven't noticed that a lot. Most of our companies need industrial
 strength software that has been tested, put through every possible scenario
 or failure to make sure everything works correctly.
 My gut is that random passionate geeks (playing part-time) do better
 testing than a military of professionals ... (but I've no empirical evidence
 here)

 I am not taking a side here (although I appreciate those who do!) .. but
 looking for an objective reasoning.

 Thanks,
 S

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Using R for Production - Discussion

2010-11-01 Thread Santosh Srinivas
Hello Group,

This is an open-ended question.

Quite fascinated by the things I can do and the control I have on my
activities since I started using R.
I basically have been using this for analytical related work off my desktop.
My experience has been quite good and most issues where I need to
investigate and solve are typical items more related to data errors, format
corruption, etc... not necessarily R Related.

Complementing this with Python gives enough firepower to do lots of
production (analytical related activities) on the cloud (from my research I
see that every innovative technology provider seems to support Python ...
google, amazon, etc).

Question on using R for Production activities:
Q1) Does anyone have experience of using R-scripts etc ... for production
related activities. E.g. serving off a computational/ analytical /
simulation environment from a webportal with the analytical processing done
in R.
I've seen that most useful things for normal (not rocket science) business
(80-20 rule) can be done just as well in R in comparison with tools like
SAS, Matlab, etc.

Q2) I haven't tried the processing routines for much larger data-sets
assuming size is not a constraint nowadays.
I know that I should try out ... but any forewarnings would help. Is it
likely that something that works for my desktop dataset is quite as likely
to work when scaled up to a cloud dataset?
Assuming that I do the clearing out of unused objects, not running into
infinite loops, etc?

i.e. is there any problem with the fundamental architecture of R itself?
(like press articles often say)


Q3) There are big fans of the SAS, Matlab, Mathworks environments out there
 does anyone have a comparison of how R fares.
From my experience R is quite neat and low level ... so overheads should be
quite low.
Most slowness comes due to lack of knowledge (see my code ... like using the
wrong structures, functions, loops, etc.) rather than something wrong with
the way R itself is.
Perhaps there is no commercial focus to enhance performance related issues
but my guess is that it is just matter of time till the community evolves
the language to score higher on that too.
And perhaps develops documentation to assist the challenge users with
performance tips (the ten commandments types)

Q4) You must have heard about the latest comment from James Goodnight of SAS
... We haven't noticed that a lot. Most of our companies need industrial
strength software that has been tested, put through every possible scenario
or failure to make sure everything works correctly.
My gut is that random passionate geeks (playing part-time) do better
testing than a military of professionals ... (but I've no empirical evidence
here)

I am not taking a side here (although I appreciate those who do!) .. but
looking for an objective reasoning.

Thanks,
S

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.