Re: [R] Newbie - Scrape Data From PDFs?

2018-01-23 Thread Ulrik Stervbo
I think I would use pdftk to extract the form data. All subsequent
manipulation in R.

HTH
Ulrik

Eric Berger  schrieb am Mi., 24. Jan. 2018, 08:11:

> Hi Scott,
> I have never done this myself but I read something recently on the
> r-help distribution that was related.
> I just did a quick search and found a few hits that might work for you.
>
> 1.
> https://medium.com/@CharlesBordet/how-to-extract-and-clean-data-from-pdf-files-in-r-da11964e252e
> 2. http://bxhorn.com/2016/extract-data-tables-from-pdf-files-in-r/
> 3.
> https://www.rdocumentation.org/packages/textreadr/versions/0.7.0/topics/read_pdf
>
> HTH,
> Eric
>
> On Wed, Jan 24, 2018 at 3:58 AM, Scott Clausen 
> wrote:
> > Hello,
> >
> > I’m new to R and am using it with RStudio to learn the language. I’m
> doing so as I have quite a lot of traffic data I would like to explore. My
> problem is that all the data is located on a number of PDFs. Can someone
> point me to info on gathering data from other sources? I’ve been to the R
> FAQ and didn’t see anything and would appreciate your thoughts.
> >
> >  I am quite sure now that often, very often, in matters concerning
> religion and politics a man's reasoning powers are not above the monkey's.
> >
> > -- Mark Twain
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Function gutenberg_download in the gutenbergr package

2018-01-23 Thread Patrick Connolly

I've been working through https://www.tidytextmining.com/tidytext.html
wherein everything worked until I got to this part in section 1.5

> hgwells <- gutenberg_download(c(35, 36, 5230, 159))
Determining mirror for Project Gutenberg from 
http://www.gutenberg.org/robot/harvest
Error in open.connection(con, "rb") : 
  Failed to connect to www.gutenberg.org port 80: Connection timed out

Which indicates the problem is at the very start:

  if (is.null(mirror)) {
mirror <- gutenberg_get_mirror(verbose = verbose)
  }

The documentation for gutenberg_get_mirror indicates there's nothing
different I could set.

So I tried specifying my usual mirror:

> hgwells <- gutenberg_download(c(1260, 768, 969, 9182, 767), mirror = 
> "http://cran.stat.auckland.ac.nz";)
Error in read_zip_url(full_url) : could not find function "read_zip_url"
> 

Which is, indeed, strange since according to 

> help.search("read_zip_url")
Help files with alias or concept or title matching ‘read_zip_url’ using
regular expression matching:


gutenbergr::read_zip_url
Read a file from a .zip URL
  Aliases: read_zip_url

[...]

And according to 
library(help = "gutenbergr")

[...]
Index:

gutenberg_authors   Metadata about Project Gutenberg authors
gutenberg_download  Download one or more works using a Project
Gutenberg ID
gutenberg_get_mirrorGet the recommended mirror for Gutenberg files
gutenberg_metadata  Gutenberg metadata about each work
gutenberg_strip Strip header and footer content from a Project
Gutenberg book
gutenberg_subjects  Gutenberg metadata about the subject of each
work
gutenberg_works Get a filtered table of Gutenberg work metadata
read_zip_urlRead a file from a .zip URL

[...]

However, when I look at the list for that part of the search(), there
is no read_zip_url but all the rest of that list are present.  So it's
not surprising that it isn't found.  But it puzzles me that it is not
there.

Ideas as to where I should proceed gratefully appreciated.


> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/hrapgc/local/R-3.4.2/lib/libRblas.so
LAPACK: /home/hrapgc/local/R-3.4.2/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_NZ.UTF-8   LC_NUMERIC=C  
 [3] LC_TIME=en_NZ.UTF-8LC_COLLATE=en_NZ.UTF-8
 [5] LC_MONETARY=en_NZ.UTF-8LC_MESSAGES=en_NZ.UTF-8   
 [7] LC_PAPER=en_NZ.UTF-8   LC_NAME=C 
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] grDevices utils stats graphics  methods   base 

other attached packages:
 [1] sos_2.0-0  brew_1.0-6 gutenbergr_0.1.3   ggplot2_2.2.1 
 [5] stringr_1.2.0  bindrcpp_0.2   dplyr_0.7.4janeaustenr_0.1.5 
 [9] tidytext_0.1.6 FactoMineR_1.38readxl_1.0.0   tm_0.7-3  
[13] NLP_0.1-11 wordcloud_2.5  RColorBrewer_1.1-2 lattice_0.20-35   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13 cellranger_1.1.0 compiler_3.4.2  
 [4] plyr_1.8.4   bindr_0.1tokenizers_0.1.4
 [7] tools_3.4.2  gtable_0.2.0 tibble_1.3.4
[10] nlme_3.1-131 pkgconfig_2.0.1  rlang_0.1.2 
[13] Matrix_1.2-11psych_1.7.8  curl_3.0
[16] parallel_3.4.2   xml2_1.1.1   cluster_2.0.6   
[19] hms_0.3  flashClust_1.01-2grid_3.4.2  
[22] scatterplot3d_0.3-40 glue_1.1.1   ellipse_0.3-8   
[25] R6_2.2.2 foreign_0.8-69   readr_1.1.1 
[28] purrr_0.2.4  tidyr_0.7.2  reshape2_1.4.2  
[31] magrittr_1.5 scales_0.5.0 SnowballC_0.5.1 
[34] MASS_7.3-47  leaps_3.0assertthat_0.2.0
[37] mnormt_1.5-5 colorspace_1.3-2 labeling_0.3
[40] stringi_1.1.5lazyeval_0.2.1   munsell_0.4.3   
[43] slam_0.1-42  broom_0.4.2 
> 

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___Patrick Connolly   
 {~._.~}   Great minds discuss ideas
 _( Y )_ Average minds discuss events 
(:_~*~_:)  Small minds discuss people  
 (_)-(_)  . Eleanor Roosevelt
  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Newbie - Scrape Data From PDFs?

2018-01-23 Thread Eric Berger
Hi Scott,
I have never done this myself but I read something recently on the
r-help distribution that was related.
I just did a quick search and found a few hits that might work for you.

1. 
https://medium.com/@CharlesBordet/how-to-extract-and-clean-data-from-pdf-files-in-r-da11964e252e
2. http://bxhorn.com/2016/extract-data-tables-from-pdf-files-in-r/
3. 
https://www.rdocumentation.org/packages/textreadr/versions/0.7.0/topics/read_pdf

HTH,
Eric

On Wed, Jan 24, 2018 at 3:58 AM, Scott Clausen  wrote:
> Hello,
>
> I’m new to R and am using it with RStudio to learn the language. I’m doing so 
> as I have quite a lot of traffic data I would like to explore. My problem is 
> that all the data is located on a number of PDFs. Can someone point me to 
> info on gathering data from other sources? I’ve been to the R FAQ and didn’t 
> see anything and would appreciate your thoughts.
>
>  I am quite sure now that often, very often, in matters concerning religion 
> and politics a man's reasoning powers are not above the monkey's.
>
> -- Mark Twain
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Newbie - Scrape Data From PDFs?

2018-01-23 Thread Scott Clausen
Hello,

I’m new to R and am using it with RStudio to learn the language. I’m doing so 
as I have quite a lot of traffic data I would like to explore. My problem is 
that all the data is located on a number of PDFs. Can someone point me to info 
on gathering data from other sources? I’ve been to the R FAQ and didn’t see 
anything and would appreciate your thoughts.

 I am quite sure now that often, very often, in matters concerning religion and 
politics a man's reasoning powers are not above the monkey's.

-- Mark Twain

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Scraping from different level URLs website

2018-01-23 Thread Jeff Newmiller
They seem to release their data in xml and csv formats also... why are you 
scraping? 
-- 
Sent from my phone. Please excuse my brevity.

On January 23, 2018 9:31:01 AM PST, Ilio Fornasero  
wrote:
>I am doing a research on World Bank (WB) projects on developing
>countries. To do so, I am scraping their website in order to collect
>the data I am interested in.
>
>The structure of the webpage I want to scrape is the following:
>
>1.  List of countries the list of all countries in which WB has
>developed projects
>
>1.1. By clicking on a single country on 1. , one gets the single
>countries project list (that includes many webpages) it includes all
>the projects in a single countries
>
>. Of course, here I have included just one page of a single countries,
>but every country has a number of pages dedicated to this subject
>
>1.1.1. By clicking on a a single project on 1.1. , one gets - among the
>others - the project's overview
>option I
>am interested in.
>
>In other words, my problem is to find out a way to create a dataframe
>including all the countries, a complete list of all projects for each
>country and an overview of any single project.
>
>
>Yet, this is the code that I have (unsuccessfully) written:
>
>WB_links <-
>"http://projects.worldbank.org/country?lang=en&page=projects";
>
> WB_proj <- function(x) {
>
>  Sys.sleep(5)
>url <-
>sprintf("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s";,
>x)
>
> html <- read_html(url)
>
>tibble(title = html_nodes(html, ".grid_20") %>% html_text(trim = TRUE),
> project_url = html_nodes(html, ".grid_20") %>% html_attr("href"))
>}
>
> WB_scrape <- map_df(1:5, WB_proj) %>%
> mutate(study_description =
>   map(project_url,
>   ~read_html(sprintf
>("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s";,
>.x)) %>%
>html_node() %>%
>html_text()))
>
>
>Any suggestion?
>
>Note: I am sorry if this question seems trivial, but I am quite a
>newbie in R and I haven't found a help on this by looking around
>(though I could have missed something, of course).
>
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Scraping from different level URLs website

2018-01-23 Thread Ilio Fornasero
I am doing a research on World Bank (WB) projects on developing countries. To 
do so, I am scraping their website in order to collect the data I am interested 
in.

The structure of the webpage I want to scrape is the following:

  1.  List of countries the list of all countries in which WB has developed 
projects

1.1. By clicking on a single country on 1. , one gets the single countries 
project list (that includes many webpages) it includes all the projects in a 
single countries 
 
. Of course, here I have included just one page of a single countries, but 
every country has a number of pages dedicated to this subject

1.1.1. By clicking on a a single project on 1.1. , one gets - among the others 
- the project's overview 
option I am 
interested in.

In other words, my problem is to find out a way to create a dataframe including 
all the countries, a complete list of all projects for each country and an 
overview of any single project.


Yet, this is the code that I have (unsuccessfully) written:

WB_links <- "http://projects.worldbank.org/country?lang=en&page=projects";

 WB_proj <- function(x) {

  Sys.sleep(5)
 url <- 
sprintf("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s";,
 x)

 html <- read_html(url)

 tibble(title = html_nodes(html, ".grid_20") %>% html_text(trim = TRUE),
 project_url = html_nodes(html, ".grid_20") %>% html_attr("href"))
}

 WB_scrape <- map_df(1:5, WB_proj) %>%
 mutate(study_description =
   map(project_url,
   ~read_html(sprintf
 
("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s";,
 .x)) %>%
html_node() %>%
html_text()))


Any suggestion?

Note: I am sorry if this question seems trivial, but I am quite a newbie in R 
and I haven't found a help on this by looking around (though I could have 
missed something, of course).


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] interaction term by a factor group in gamm4

2018-01-23 Thread Maria Lathouri via R-help
Dear all, 
I am writing as I would really need your help on the problem with gamm4. I have 
tried to find a solution online but I wasn't very successful. 
I am running a gamm4 model with an interaction between two variable using the 
tensor term, t2. I have a group variable (super end group) with six factors; I 
would like to run the model to see the how the interaction term varies across 
the factor levels and get the plot for each factor in one page. Here is my code 
but I get the following message 

dat<-read.table("dat.txt", header=TRUE) 

str(dat) 
#'data.frame':11744 obs. of  11 variables: 
#$ WATERBODY_ID  : Factor w/ 1994 levels "GB102021072830",..: 1 1 2 2 2 
3 3 3 4 4 5 5 5 5 5 6 ... 
#$ SITE_ID  : int  157166 157166 1636 1636 1636 1635 1635 1635 
134261 1631 65383 65383 65383 111828 ... 
#$ Year  : int  2011 2014 2013 2006 2003 2002 2013 2005 2013 
2006 2011 2005 2004 ... 
#$ ResidualQ95  : num  100 100 80 80 80 98 98 98 105 105 101 101 130 
120 120 ... 
#$ LIFE.OE_spring: num  1.02 1.03 1.02 1.06 1.06 1.07 1.12 1.05 1.14 
1.05 1.09 1.14 1.04 0.97 0.98 ... 
#$ super.end.group  : Factor w/ 6 levels "B","C","D","E",..: 1 1 3 3 3 2 2 
2 4 4 ... 
#$ X.urban.suburban  : num  0 0 0.07 0.07 0.07 0.53 0.53 0.53 8.07 8.07 
0.27 0.27 0.27 0.27 0.27 0.72 ... 
#$ X.broadleaved_woodland: num  2.83 2.83 10.39 10.39 10.39 7.72 7.72 21.15 
21.15 14.44 14.44 ... 
#$ X.CapWks  : num  0 0 0 0 0 0 0 0 0 0 8.11 8.11 8.11 0 0 0 42.06 
42.06 7.08 0.2 ... 
#$ Hms_Poaching  : num  0 0 10 10 10 0 0 0 0 10 10 20 40 5 30 15 15 0 0 
0 50 50 ... 
#$ Hms_Rsctned  : num  0 0 0 0 0 0 0 0 0 0 2480 800 1960 1160 740 0 0 
960 ... 

library(gamm4) 

model<-gamm4(LIFE.OE_spring~s(ResidualQ95, by=super.end.group)+t2(ResidualQ95, 
Hms_Rsctned, by=super.end.group)+Year +Hms_Poaching +X.broadleaved_woodland 
+X.urban.suburban +X.CapWks, data=dat, random=~(1|WATERBODY_ID/SITE_ID)) 
#Warning messages: 
#1: In optwrap(optimizer, devfun, getStart(start, rho$lower, rho$pp),  : 
#  convergence code 1 from bobyqa: bobyqa -- maximum number of function 
evaluations exceeded 
#2: In optwrap(optimizer, devfun, opt$par, lower = rho$lower, control = 
control,  : 
#  convergence code 1 from bobyqa: bobyqa -- maximum number of function 
evaluations exceeded 

I also tried the following but I got the same message as before. 

model<-gamm4(LIFE.OE_spring~s(ResidualQ95)+t2(ResidualQ95, Hms_Rsctned, 
by=super.end.group)+Year+ Hms_Poaching +X.broadleaved_woodland 
+X.urban.suburban +X.CapWks, data=dat, random=~(1|WATERBODY_ID/SITE_ID)) 
#fixed-effect model matrix is rank deficient so dropping 1 column / coefficient 

#Warning messages: 
#1: In optwrap(optimizer, devfun, getStart(start, rho$lower, rho$pp),  : 
#  convergence code 1 from bobyqa: bobyqa -- maximum number of function 
evaluations exceeded 
#2: In optwrap(optimizer, devfun, opt$par, lower = rho$lower, control = 
control,  : 
#  convergence code 1 from bobyqa: bobyqa -- maximum number of function 
evaluations exceeded 

When I tried to plot, I got the following error: 
plot(model$gam, page=1) 
#Error in plot.window(...) : need finite 'ylim' values 
#In addition: Warning messages: 
#  1: In min(ll, na.rm = TRUE) : 
#  no non-missing arguments to min; returning Inf 
#  2: In max(ul, na.rm = TRUE) : 
#  no non-missing arguments to max; returning -Inf 

I would really appreciate your help. My dataset is quite large that is why I 
have used the str() for you to get an idea; I have attached a txt file as well 
as a sample of my dataset but I have trimmed it a lot so it is even smaller to 
fit the email size. 

Thank you very much. 

Best, 
MariaWATERBODY_IDSITE_ID YearResidualQ95 LIFE.OE_spring  super.end.group 
X.urban.suburbanX.broadleaved_woodland  X.CapWksHms_Poaching
Hms_Rsctned
GB102021072830  157166  2011100 1.02B   0   2.830   
0   0
GB102021072830  157166  2014100 1.03B   0   2.830   
0   0
GB102076070960  65383   2007101 1.03B   0.277.720   
0   0
GB102076074040  65126   200661  1.04B   0.3332.18   0   
0   0
GB102076074040  65126   200761  1.06B   0.3332.18   0   
0   0
GB102076074070  65380   2007100 0.94B   3.561.890   
0   0
GB102076074070  65380   2005100 0.99B   3.561.890   
0   0
GB102076074070  65380   2002100 1.03B   3.561.890   
0   0
GB102076074100  63432   2013100 1.04B   0.233.230   
0   0
GB102077074270  67618   2013100 1.05B   0   4.060   
5   0
GB102077074290  161086  2013100 1.06B   0.243.810   
0   0
GB103022076790  16882002100 0.99B   0.156.650   
0   0
GB103022076790  16882

Re: [R] substr gives empty output

2018-01-23 Thread Luigi Marongiu
Thank you, I got it, now it works good

On Mon, Jan 22, 2018 at 1:58 PM, Howard, Tim G (DEC) 
wrote:

> In
>
>  y <- substr(x, i, 1)
>
> your third integer needs to be the location not the number of digits, so
> change it to
>
>  y <- substr(x, i, i)
>
> and you should get what you want.
> Cheers,
> Tim
>
> > Date: Sun, 21 Jan 2018 10:50:31 -0500
> > From: Ek Esawi 
> > To: Luigi Marongiu , r-help@r-project.org
> > Subject: Re: [R] substr gives empty output
> > Message-ID:
> >  > ri_...@mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > The reason you get "" is, as stated on the previous response and on the
> > documentation of substr function, the function "When extracting, if
> start is
> > larger than the string length then "" is returned.". This is what
> happens on
> > your function.
> >
> > HTH
> >
> > EK
> >
> > On Sun, Jan 21, 2018 at 3:59 AM, Luigi Marongiu <
> marongiu.lu...@gmail.com>
> > wrote:
> > > Dear all,
> > > I have a string, let's say "testing", and I would like to extract in
> > > sequence each letter (character) from it. But when I use substr() I
> > > only properly get the first character, the rest is empty (""). What am
> > > I getting wrong?
> > > For example, I have this code:
> > >
> > 
> > > x <- "testing"
> > > k <- nchar(x)
> > > for (i in 1:k) {
> > >   y <- substr(x, i, 1)
> > >   print(y)
> > > }
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > __
> > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.