Re: [R] Good Package(s) for String and URL processing?

2010-07-02 Thread Tobias Verbeke

On 07/02/2010 05:51 AM, Erik Iverson wrote:

Ralf B wrote:

Are there packages that allow improved String and URL processing?
E.g. extract parts of a URLs such as sub-domains, top-level domain,
protocols (e.g. https, http, ftp), file type based on endings, check
if a URL is valid or not, etc...

I am currently only using split and paste. Are there better and more
efficient ways to handle strings e.g. finding sub-strings or to do
pattern matching?
What packages do you use if you have to do a lot of String processing
and you don't have the option to go to another language such as Perl
or Python?



Well, much of the power of Perl is built on top of regular expressions,
which R also supports.

See ?regex for more details. Also the R functions ?grep, ?sub, etc.

I can also highly recommend the book Mastering Regular Expressions. It
does not cover R explicitly, but what you learn in there can be directly
applied to R. Regexs go very, very far with helping you with the task of
finding substrings and pattern matching.

You might find some things in RCurl helpful:

http://www.omegahat.org/RCurl/

Probably others...


Including gsubfn by Gabor Grothendieck
and stringr by Hadley Wickham

http://cran.r-project.org/web/packages/gsubfn/index.html
http://cran.r-project.org/web/packages/stringr/index.html

Best,
Tobias

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Good Package(s) for String and URL processing?

2010-07-02 Thread Gabor Grothendieck
On Thu, Jul 1, 2010 at 11:08 PM, Ralf B ralf.bie...@gmail.com wrote:
 Are there packages that allow improved String and URL processing?
 E.g. extract parts of a URLs such as sub-domains, top-level domain,
 protocols (e.g. https, http, ftp), file type based on endings, check
 if a URL is valid or not, etc...

You are asking to match and extract by content rather than delimiter
and you can do that with the strapply function in the gsubfn package.
 Here is an example.  You can likely improve on the regular expression
but this gives the idea. In the first example of strapply we just
return the back references (the portions in parentheses) and in the
second we display the various parts labelled with their names.  To
remember the arguments note that just as apply is
object/modifier/function so is strapply; however, the modifier for
strapply is a pattern rather than an array margin.  In the first
example the function was just c and in the second example we used a
formula notation which strapply converts to a function.  In this case
it constructs the function
   function(...) cat(protocol:, ..1, server:, ..3, host:, ..4,
domain:, ..5, path:, ..7, \n))
which we could have used in place of the formula.

 library(gsubfn)
 myurl - http://abc.com/main/def.html;
 pat - ^(\\w+)://((\\w+)[.])?(\\w+)[.](\\w+)(/(.*))$

 strapply(myurl, pat, c, simplify = unlist)
[1] http abc
[5] com/main/def.html main/def.html

 junk - strapply(myurl, pat, ~ cat(protocol:, ..1, server:, ..3, host:, 
 ..4, domain:, ..5, path:, ..7, \n))
protocol: http server:  host: abc domain: com path: main/def.html

gsubfn and strapply in the gsubfn package support ordinary regular
expressions and perl regular expression as in R and also support tcl
regular expressions.


 I am currently only using split and paste. Are there better and more
 efficient ways to handle strings e.g. finding sub-strings or to do
 pattern matching?

Read the help pages of these commands:

 help.search(keyword = character, package = base)

 What packages do you use if you have to do a lot of String processing
 and you don't have the option to go to another language such as Perl
 or Python?

See the gsubfn home page at: http://gsubfn.googlecode.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Good Package(s) for String and URL processing?

2010-07-02 Thread David Winsemius


On Jul 1, 2010, at 11:08 PM, Ralf B wrote:


Are there packages that allow improved String and URL processing?
E.g. extract parts of a URLs such as sub-domains, top-level domain,
protocols (e.g. https, http, ftp), file type based on endings, check
if a URL is valid or not, etc...

I am currently only using split and paste. Are there better and more
efficient ways to handle strings e.g. finding sub-strings or to do
pattern matching?
What packages do you use if you have to do a lot of String processing
and you don't have the option to go to another language such as Perl
or Python?


You may want to look at the tm package.

--
David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Good Package(s) for String and URL processing?

2010-07-01 Thread Ralf B
Are there packages that allow improved String and URL processing?
E.g. extract parts of a URLs such as sub-domains, top-level domain,
protocols (e.g. https, http, ftp), file type based on endings, check
if a URL is valid or not, etc...

I am currently only using split and paste. Are there better and more
efficient ways to handle strings e.g. finding sub-strings or to do
pattern matching?
What packages do you use if you have to do a lot of String processing
and you don't have the option to go to another language such as Perl
or Python?

Thanks,
Ralf

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Good Package(s) for String and URL processing?

2010-07-01 Thread Erik Iverson

Ralf B wrote:

Are there packages that allow improved String and URL processing?
E.g. extract parts of a URLs such as sub-domains, top-level domain,
protocols (e.g. https, http, ftp), file type based on endings, check
if a URL is valid or not, etc...

I am currently only using split and paste. Are there better and more
efficient ways to handle strings e.g. finding sub-strings or to do
pattern matching?
What packages do you use if you have to do a lot of String processing
and you don't have the option to go to another language such as Perl
or Python?



Well, much of the power of Perl is built on top of regular expressions, which R 
also supports.


See ?regex for more details.  Also the R functions ?grep, ?sub, etc.

I can also highly recommend the book Mastering Regular Expressions.  It does 
not cover R explicitly, but what you learn in there can be directly applied to 
R.  Regexs go very, very far with helping you with the task of finding 
substrings and pattern matching.


You might find some things in RCurl helpful:

http://www.omegahat.org/RCurl/

Probably others...

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.