Re: [R] subset English language using textcat package

2018-11-19 Thread Robert David Burbidge via R-help

Look at the help docs and examples for textcat and sapply:

print(as.character(data$x[sapply(data$x, textcat)=="english"]))

Although textcat defaults classify "This book is amazing" as dutch, so 
you may want to read the help for textcat and change the profile db 
("p") or "method".


On 19/11/2018 09:48, Elahe chalabi via R-help wrote:

Hi all,

How is it possible to subset English text from a df containing German and 
English texts using textcat package?



 > library(textcat)
 > dput(data)
 structure(list(x = structure(c(2L, 6L, 5L, 3L, 1L, 4L), .Label = c("Dieses Buch 
ist erstaunlich",
 "I love this book", "ich liebe dieses Buch", "mehrere bücher in prozess",
 "several books in proccess", "This book is amazing"), class = "factor")), 
row.names = c(NA,
 -6L), class = "data.frame")

I want the output to be like the following:


 "I love this book"  "This book is amazing"  "several books in proccess"


Thanks for any help!
Elahe



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help with Centroids

2018-11-14 Thread Robert David Burbidge via R-help
# construct the dataframe
`TK-QUADRANT` <- c(9161,9162,9163,9164,10152,10154,10161,10163)
LAT <- 
c(55.07496,55.07496,55.02495,55.02496,54.97496,54.92495,54.97496,54.92496)
LON <- 
c(8.37477,8.458109,8.37477,8.45811,8.291435,8.291437,8.374774,8.374774)
df <- data.frame(`TK-QUADRANT`=`TK-QUADRANT`,LAT=LAT,LON=LON)

# group the data and calculate means by group
df$group <- floor(df$TK.QUADRANT/10)*10
out <- aggregate(df[c('LAT','LON')],by=list(df$group),mean)
print(out)

# see also:
# 
https://livefreeordichotomize.com/2018/06/27/bringing-the-family-together-finding-the-center-of-geographic-points-in-r/

Rgds,
Robert
On 14/11/2018 11:13, sasa kosanic wrote:
>  Dear Robert,
> Thank  you for your very much for your reply. Please see attached pdf  
> fille.
> I hope now it is more clear what I am trying to do:
> calculate new latitude and  longitude  of the centroids from the 
> existing cells...
> as you can see from the attached pdf.  from Lat/ Long of 
> 9161,9162,9163,9164 I need to calculate a single Lat/Long that could 
> be fore example called 9160
> and then from lat/ long of 10152 and 10154 a  new single lat/long 
> called 10150 .
> But guess I would need some kind of loop as this is just an example 
> table  and the the whole table is covering whole Germany.
> Please let me know if it is still not clear what I am trying to do here.
>
> Best wishes,
> Sasha
>
> On Wed, 14 Nov 2018 at 07:58, Robert David Burbidge 
>  > wrote:
>
> Hi Sasha,
>
> Your attached table did not come through, please see the posting
> guidelines:
> "No binary attachments except for PS, PDF, and some image and archive
> formats (others are automatically stripped off because they can
> contain
> malicious software). Files in other formats and larger ones should
> rather be put on the web and have only their URLs posted. This way a
> reader has the option to download them or not."
> https://www.r-project.org/posting-guide.html
>
> It is not clear what you are trying to do. As a first step it
> looks like
> you want something like:
>  
> lat <- c(9161,9162,9163,9164,10152,10154)
> floor(lat/10)*10
> 
>
> Please provide further details on what you are trying to do.
>
> Rgds,
>
> Robert
>
>
> On 13/11/2018 09:51, sasa kosanic wrote:
> > Dear All, I am pretty new to R and would appreciate a help how to
> > calculate centroids from the latitude and longitude of existing
> cells
> > (e.g. to get centroid for a new cell I would need to combine
> latitude
> > and 9161,9162,9163,9164 to 9160 or 10152, 10154 to 10150 etc.)
> Please
> > see attached table. Thank you very much! Best, Sasha
>
>
>
> -- 
>
> Dr Sasha Kosanic
> Ecology Lab (Biology Department)
> Room M644
> University of Konstanz
> Universitätsstraße 10
> D-78464 Konstanz
> Phone: +49 7531 883321 & +49 (0)175 9172503
>
> http://cms.uni-konstanz.de/vkleunen/
> https://tinyurl.com/y8u5wyoj
> https://tinyurl.com/cgec6tu
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] POS tagging generating a string

2018-11-13 Thread Robert David Burbidge via R-help

On 13/11/2018 12:31, Elahe chalabi wrote:


Hi Robert,

Thanks for your reply but your code returns the number of verbs in each 
massage. What I want is a string showing verbs in each massage.


The output of my code (below) is:

# A tibble: 4 x 2
  DocumentID verbs
    
1 478920 has|been|updated
2 499497 explained
3 510133 it
4 930234 Thank

Is this not what you wanted?

Rgds,

Robert


On Wednesday, November 7, 2018 7:31 AM, Robert David Burbidge 
 wrote:



Hi Elahe,
You could modify your count_verbs function from your previous post:
 * use scan to extract the tokens (words) from Message
 * use your previous grepl expression to index the tokens that are verbs
 * paste the verbs together to form the entries of a new column.Here is one 
solution:

library(openNLP)
library(NLP)

df <- data.frame(DocumentID = c(478920L, 510133L, 499497L, 930234L),
  Message = structure(c(4L, 2L, 3L, 1L), .Label = c("Thank you very 
much for your nice feedback.\n",
"THank you, added 
it", "Thanks for the well explained article.",
"The solution has been 
updated"), class = "factor"))


dput(df)

tagPOS <-  function(x, ...) {
   s <- as.String(x)
   if(s=="") return(list())
   word_token_annotator <- Maxent_Word_Token_Annotator()
   a2 <- Annotation(1L, "sentence", 1L, nchar(s))
   a2 <- annotate(s, word_token_annotator, a2)
   a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
   a3w <- a3[a3$type == "word"]
   POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
   POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
   list(POStagged = POStagged, POStags = POStags)
}

verbs <-function(x) {
   tagPOSx <- tagPOS(x)
   scanx <- scan(text=as.character(x), what="character")
   n <- length(scanx)
   paste(scanx[(1:n)[grepl("VB", tagPOSx$POStags)]], collapse="|")
}

library(dplyr)

df %>% group_by(DocumentID) %>% summarise(verbs = verbs(Message))
<

I'll leave it to you to extract a column of verbs from the result
 and rbind it to the original data.frame.

Btw, I don't this solution is efficient, I would guess that the
 processing that scan does in the verbs function is duplicating
 work already done in the tagPOS function by annotate, so you may
 want to return a list of tokens from tagPOS and use that instead
 of scan.

Rgds,
Robert


On 06/11/18 10:26, Elahe chalabi via R-help wrote:

Hi all, In my df I would like to generate a new column which contains a string 
showing all the verbs in each row of df$Message.

library(openNLP) library(NLP) dput(df) structure(list(DocumentID = c(478920L, 510133L, 499497L, 930234L ), Message = structure(c(4L, 2L, 3L, 1L), .Label = c("Thank you very much for your nice feedback.\n", 
"THank you, added it", "Thanks for the well explained article.", "The solution has been updated"), class = "factor")), class = "data.frame", row.names = c(NA, -4L)) 
tagPOS <- function(x, ...) { s <- as.String(x) word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) 
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2) a3w <- a3[a3$type == "word"] POStags <- unlist(lapply(a3w$features, `[[`, "POS")) POStagged <- paste(sprintf("%s/%s", 
s[a3w], POStags), collapse = " ") list(POStagged = POStagged, POStags = POStags) } Any help? Thanks in advance! Elahe __ R-help@r-project.org mailing list -- To 
UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help with Centroids

2018-11-13 Thread Robert David Burbidge via R-help

Hi Sasha,

Your attached table did not come through, please see the posting guidelines:
"No binary attachments except for PS, PDF, and some image and archive 
formats (others are automatically stripped off because they can contain 
malicious software). Files in other formats and larger ones should 
rather be put on the web and have only their URLs posted. This way a 
reader has the option to download them or not."

https://www.r-project.org/posting-guide.html

It is not clear what you are trying to do. As a first step it looks like 
you want something like:


lat <- c(9161,9162,9163,9164,10152,10154)
floor(lat/10)*10


Please provide further details on what you are trying to do.

Rgds,

Robert


On 13/11/2018 09:51, sasa kosanic wrote:
Dear All, I am pretty new to R and would appreciate a help how to 
calculate centroids from the latitude and longitude of existing cells 
(e.g. to get centroid for a new cell I would need to combine latitude 
and 9161,9162,9163,9164 to 9160 or 10152, 10154 to 10150 etc.) Please 
see attached table. Thank you very much! Best, Sasha


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] saveRDS() and readRDS() Why? [solved, kind of]

2018-11-08 Thread Robert David Burbidge via R-help
Apologies, unserialize takes a connection, not a file, so you would need 
something like:


# linux (not run)
f <- file("rawData.rds", open="r")
rawData <- unserialize(f)
close(f)

The help file states that readRDS will read a file created by serialize 
(saveRDS is a wrapper for serialize).


It appears that the problem was "byte-shuffling at both ends when 
transferring data from one little-endian machine to another" and was 
worked around by using xdr = FALSE. So, this wouldn't necessarily work 
when transferring between big-endian and little-endian machines.


On 08/11/18 07:27, Patrick Connolly wrote:

Many thanks to Berwin, Eric, Robert, and Jan for their input.

I had hoped it was as simple as because I typed

saveRDS("rawData", file = "rawData.rds") on the Windows side.
but that wasn't the case.

Robert Burbridge suggested:

  windows (not run)
f <- file("rawData.rds", open="w")
serialize(rawData, f, xdr = FALSE)
close(f)

# linux
rawData <- unserialize(file = "rawData.rds")

That didn't work:
Error in unserialize(file = "rawData.rds") :
   unused argument (file = "rawData.rds")
(the argument isn't 'file')

Nor did

rawData <- unserialize("rawData.rds")

Error in unserialize("rawData.rds") :
   character vectors are no longer accepted by unserialize()

However

readRDS(file = "rawData.rds") did!

So what I needed was serialize but not unserialize.

I still don't know Why, but I know How.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] saveRDS() and readRDS() Why?

2018-11-07 Thread Robert David Burbidge via R-help

Patrick,

I cannot reproduce this behaviour. I'm using:

Windows 8.1; R 3.5.1; RStudio 1.1.463

running in a VirtualBox on Ubuntu 18.04 with R 3.4.4; RStudio 1.1.456

The file size of rawData.rds is always 88 bytes in my example and od 
gives the same results on Windows and Linux.


I am using a VirtualBox shared folder to transfer from Windows to Linux.

Could you provide details of your machines?

Rgds,

Robert


On 07/11/18 07:56, Patrick Connolly wrote:

 From a Windows R session, I do
  

object.size(rawData)

31736 bytes  # from scraping a non-reproducible web address.

saveRDS(rawData, file = "rawData.rds")

Then copy to a Linux session


rawData <- readRDS(file = "rawData.rds")
rawData

[1] "rawData"

object.size(rawData)

112 bytes

rawData

[1] "rawData" # only the name and something to make up 112 bytes
Have I misunderstood the syntax?

It's an old version on Windows.  I haven't used Windows R since then.

major  3
minor  2.4
year   2016
month  03
day16


I've tried R-3.5.0 and R-3.5.1 Linux versions.

In case it's material ...

I couldn't get the scraping to work on either of the R installations
but Windows users told me it worked for them.  So I thought I'd get
the R object and use it.  I could understand accessing the web address
could have different permissions for different OSes, but should that
affect the R objects?

TIA



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] saveRDS() and readRDS() Why?

2018-11-07 Thread Robert David Burbidge via R-help

If the file sizes are the same, then presumably both contain the binary data. 
From the serialize function help:

"As almost all systems in current use are little-endian, xdr = FALSE can be used to 
avoid byte-shuffling at both ends when transferring data from one little-endian machine 
to another (or between processes on the same machine). Depending on the system, this can 
speed up serialization and unserialization by a factor of up to 3x."

So you could try:

# windows (not run)
f <- file("rawData.rds", open="w")
serialize(rawData, f, xdr = FALSE)
close(f)

# linux
rawData <- unserialize(file = "rawData.rds")

HTH

On 07/11/18 08:45, Patrick Connolly wrote:


On Wed, 07-Nov-2018 at 08:27AM +, Robert David Burbidge wrote:

|> Hi Patrick,
|>
|> From the help: "save writes a single line header (typically
|> "RDXs\n") before the serialization of a single object".
|>
|> If the file sizes are the same (see Eric's message), then the
|> problem may be due to different line terminators. Try serialize and
|> unserialize for low-level control of saving/reading objects.

I'll have to find out what 'serialize' means.

On Windows, it's a huge table, looks like it's all hexadecimal.

On Linux, it's just the text string 'rawData' -- a lot more than line
terminators.

Have I misunderstood what the idea is?  I thought I'd get an identical
object, irrespective of how different the OS stores and zips it.



|>
|> Rgds,
|>
|> Robert
|>
|>
|> On 07/11/18 08:13, Eric Berger wrote:
|> >What do you see at the OS level?
|> >i.e. on windows
|> >DIR rawData.rds
|> >on linux
|> >ls -l rawData.rds
|> >compare the file sizes on both.
|> >
|> >
|> >On Wed, Nov 7, 2018 at 9:56 AM Patrick Connolly 
|> >wrote:
|> >
|> >> From a Windows R session, I do
|> >>
|> >>>object.size(rawData)
|> >>31736 bytes  # from scraping a non-reproducible web address.
|> >>>saveRDS(rawData, file = "rawData.rds")
|> >>Then copy to a Linux session
|> >>
|> >>>rawData <- readRDS(file = "rawData.rds")
|> >>>rawData
|> >>[1] "rawData"
|> >>>object.size(rawData)
|> >>112 bytes
|> >>>rawData
|> >>[1] "rawData" # only the name and something to make up 112 bytes
|> >>Have I misunderstood the syntax?
|> >>
|> >>It's an old version on Windows.  I haven't used Windows R since then.
|> >>
|> >>major  3
|> >>minor  2.4
|> >>year   2016
|> >>month  03
|> >>day16
|> >>
|> >>
|> >>I've tried R-3.5.0 and R-3.5.1 Linux versions.
|> >>
|> >>In case it's material ...
|> >>
|> >>I couldn't get the scraping to work on either of the R installations
|> >>but Windows users told me it worked for them.  So I thought I'd get
|> >>the R object and use it.  I could understand accessing the web address
|> >>could have different permissions for different OSes, but should that
|> >>affect the R objects?
|> >>
|> >>TIA
|> >>
|> >>-



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] saveRDS() and readRDS() Why?

2018-11-07 Thread Robert David Burbidge via R-help

Hi Patrick,

From the help: "save writes a single line header (typically "RDXs\n") 
before the serialization of a single object".


If the file sizes are the same (see Eric's message), then the problem 
may be due to different line terminators. Try serialize and unserialize 
for low-level control of saving/reading objects.


Rgds,

Robert


On 07/11/18 08:13, Eric Berger wrote:

What do you see at the OS level?
i.e. on windows
DIR rawData.rds
on linux
ls -l rawData.rds
compare the file sizes on both.


On Wed, Nov 7, 2018 at 9:56 AM Patrick Connolly 
wrote:


 From a Windows R session, I do


object.size(rawData)

31736 bytes  # from scraping a non-reproducible web address.

saveRDS(rawData, file = "rawData.rds")

Then copy to a Linux session


rawData <- readRDS(file = "rawData.rds")
rawData

[1] "rawData"

object.size(rawData)

112 bytes

rawData

[1] "rawData" # only the name and something to make up 112 bytes
Have I misunderstood the syntax?

It's an old version on Windows.  I haven't used Windows R since then.

major  3
minor  2.4
year   2016
month  03
day16


I've tried R-3.5.0 and R-3.5.1 Linux versions.

In case it's material ...

I couldn't get the scraping to work on either of the R installations
but Windows users told me it worked for them.  So I thought I'd get
the R object and use it.  I could understand accessing the web address
could have different permissions for different OSes, but should that
affect the R objects?

TIA

--
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
___Patrick Connolly
  {~._.~}   Great minds discuss ideas
  _( Y )_ Average minds discuss events
(:_~*~_:)  Small minds discuss people
  (_)-(_)  . Eleanor Roosevelt

~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] POS tagging generating a string

2018-11-06 Thread Robert David Burbidge via R-help
Hi Elahe,
You could modify your count_verbs function from your previous post:

  * use scan to extract the tokens (words) from Message
  * use your previous grepl expression to index the tokens that are verbs
  * paste the verbs together to form the entries of a new column.

Here is one solution:

 >>>
library(openNLP)
library(NLP)

df <- data.frame(DocumentID = c(478920L, 510133L, 499497L, 930234L),
  Message = structure(c(4L, 2L, 3L, 1L), .Label = 
c("Thank you very much for your nice feedback.\n",
"THank you, added it", "Thanks for the well explained article.",
"The solution has been updated"), class = "factor"))


dput(df)

tagPOS <-  function(x, ...) {
   s <- as.String(x)
   if(s=="") return(list())
   word_token_annotator <- Maxent_Word_Token_Annotator()
   a2 <- Annotation(1L, "sentence", 1L, nchar(s))
   a2 <- annotate(s, word_token_annotator, a2)
   a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
   a3w <- a3[a3$type == "word"]
   POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
   POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
   list(POStagged = POStagged, POStags = POStags)
}

verbs <-function(x) {
   tagPOSx <- tagPOS(x)
   scanx <- scan(text=as.character(x), what="character")
   n <- length(scanx)
   paste(scanx[(1:n)[grepl("VB", tagPOSx$POStags)]], collapse="|")
}

library(dplyr)

df %>% group_by(DocumentID) %>% summarise(verbs = verbs(Message))
<

I'll leave it to you to extract a column of verbs from the result and 
rbind it to the original data.frame.

Btw, I don't this solution is efficient, I would guess that the 
processing that scan does in the verbs function is duplicating work 
already done in the tagPOS function by annotate, so you may want to 
return a list of tokens from tagPOS and use that instead of scan.

Rgds,
Robert

On 06/11/18 10:26, Elahe chalabi via R-help wrote:
> Hi all, In my df I would like to generate a new column which contains 
> a string showing all the verbs in each row of df$Message.
>> library(openNLP) library(NLP) dput(df) 
> structure(list(DocumentID = c(478920L, 510133L, 499497L, 930234L ), 
> Message = structure(c(4L, 2L, 3L, 1L), .Label = c("Thank you very much 
> for your nice feedback.\n", "THank you, added it", "Thanks for the 
> well explained article.", "The solution has been updated"), class = 
> "factor")), class = "data.frame", row.names = c(NA, -4L)) tagPOS <- 
> function(x, ...) { s <- as.String(x) word_token_annotator <- 
> Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, 
> nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- 
> annotate(s, Maxent_POS_Tag_Annotator(), a2) a3w <- a3[a3$type == 
> "word"] POStags <- unlist(lapply(a3w$features, `[[`, "POS")) POStagged 
> <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") 
> list(POStagged = POStagged, POStags = POStags) } Any help? Thanks in 
> advance! Elahe __ 
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the 
> posting guide http://www.R-project.org/posting-guide.html and provide 
> commented, minimal, self-contained, reproducible code.




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.