Re: [R] Row exclude

2022-01-30 Thread David Carlson via R-help
You need to add "-": ` (dat3 <- dat1[-unique(c(BadName, BadAge,
BadWeight)), ])` which makes the command NOT).

David

On Sun, Jan 30, 2022 at 11:00 AM Val  wrote:

> Thank you David. What about if I want to list the excluded rows? I used
> this (dat3 <- dat1[unique(c(BadName, BadAge, BadWeight)), ]) It did not
> work.The desired output  is,   Alex,  20,  13X  John,  3BC, 175  Jack3, 34,
>  140 ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
> Thank you David.
>
> What about if I want to list the excluded rows?
> I used this
> (dat3 <- dat1[unique(c(BadName, BadAge, BadWeight)), ])
>
> It did not work.The desired output  is,
>   Alex,  20,  13X
>  John,  3BC, 175
>  Jack3, 34,  140
>
> Thank you,
>
> On Sat, Jan 29, 2022 at 10:15 PM David Carlson  wrote:
>
>> It is possible that there would be errors on the same row for different
>> columns. This does not happen in your example. If row 4 was "John6, 3BC,
>> 175X" then row 4 would be included 3 times, but we only need to remove it
>> once. Removing the duplicates is not necessary since R would not get
>> confused, but length(unique(c(BadName, BadAge, BadWeight)) indicates how
>> many lines are being removed.
>>
>> David
>>
>> On Sat, Jan 29, 2022 at 8:32 PM Val  wrote:
>>
>>> Thank you David for your help. I just have one question on this. What is
>>> the purpose of  using the "unique" function on this?   (dat2 <-
>>> dat1[-unique(c(BadName, BadAge, BadWeight)), ])   I got the same result
>>> without using it. ZjQcmQRYFpfptBannerStart
>>> This Message Is From an External Sender
>>> This message came from outside your organization.
>>> ZjQcmQRYFpfptBannerEnd
>>> Thank you David for your help.
>>>
>>> I just have one question on this. What is the purpose of  using the
>>> "unique" function on this?
>>>   (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>>>
>>> I got the same result without using it.
>>>(dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])
>>>
>>> My concern is when I am applying this for the large data set the
>>> "unique"  function may consume resources(time  and memory).
>>>
>>> Thank you.
>>>
>>> On Sat, Jan 29, 2022 at 12:30 AM David Carlson 
>>> wrote:
>>>
 Given that you know which columns should be numeric and which should be
 character, finding characters in numeric columns or numbers in character
 columns is not difficult. Your data frame consists of three character
 columns so you can use regular expressions as Bert mentioned. First
 you should strip the whitespace out of your data:

 dat1 <-read.table(text="Name, Age, Weight
   Alex,  20,  13X
   Bob,  25,  142
   Carol, 24,  120
   John,  3BC,  175
   Katy,  35,  160
   Jack3, 34,  140",sep=",", header=TRUE, stringsAsFactors=FALSE,
 strip.white=TRUE)

 Now check to see if all of the fields are character as expected.

 sapply(dat1, typeof)
 #Name Age  Weight
 # "character" "character" "character"

 Now identify character variables containing numbers and numeric
 variables containing characters:

 BadName <- which(grepl("[[:digit:]]", dat1$Name))
 BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
 BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))

 Next remove those rows:

 (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
 #Name Age Weight
 #  2   Bob  25142
 #  3 Carol  24120
 #  5  Katy  35160

 You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
 as.numeric(dat2$Age).

 David Carlson


 On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter 
 wrote:

> As character 'polluted' entries will cause a column to be read in (via
> read.table and relatives) as factor or character data, this sounds like a
> job for regular expressions. If you are not familiar with this subject,
> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
>
> As character 'polluted' entries will cause a column to be read in (via
> read.table and relatives) as factor or character data, this sounds like a
> job for regular expressions. If you are not familiar with this subject,
> time to learn. And, yes, some heavy lifting will be required.
> See ?regexp for a start maybe? Or the stringr package?
>
> Cheers,
> Bert
>
>
>
>
> On Fri, Jan 28, 2022, 7:08 PM Val  wrote:
>
> > Hi All,
> >
> > I want to remove rows that contain a character string in an integer
> > column or a digit in a character column.
> >
> > Sample data
> >
> > dat1 <-read.table(text="Name, Age, Weight
> >  Alex,  20,  13X
> >  Bob,   25,  142
> 

Re: [R] Row exclude

2022-01-30 Thread Val
Thank you David.

What about if I want to list the excluded rows?
I used this
(dat3 <- dat1[unique(c(BadName, BadAge, BadWeight)), ])

It did not work.The desired output  is,
  Alex,  20,  13X
 John,  3BC, 175
 Jack3, 34,  140

Thank you,

On Sat, Jan 29, 2022 at 10:15 PM David Carlson  wrote:

> It is possible that there would be errors on the same row for different
> columns. This does not happen in your example. If row 4 was "John6, 3BC,
> 175X" then row 4 would be included 3 times, but we only need to remove it
> once. Removing the duplicates is not necessary since R would not get
> confused, but length(unique(c(BadName, BadAge, BadWeight)) indicates how
> many lines are being removed.
>
> David
>
> On Sat, Jan 29, 2022 at 8:32 PM Val  wrote:
>
>> Thank you David for your help. I just have one question on this. What is
>> the purpose of  using the "unique" function on this?   (dat2 <-
>> dat1[-unique(c(BadName, BadAge, BadWeight)), ])   I got the same result
>> without using it. ZjQcmQRYFpfptBannerStart
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> ZjQcmQRYFpfptBannerEnd
>> Thank you David for your help.
>>
>> I just have one question on this. What is the purpose of  using the
>> "unique" function on this?
>>   (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>>
>> I got the same result without using it.
>>(dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])
>>
>> My concern is when I am applying this for the large data set the
>> "unique"  function may consume resources(time  and memory).
>>
>> Thank you.
>>
>> On Sat, Jan 29, 2022 at 12:30 AM David Carlson  wrote:
>>
>>> Given that you know which columns should be numeric and which should be
>>> character, finding characters in numeric columns or numbers in character
>>> columns is not difficult. Your data frame consists of three character
>>> columns so you can use regular expressions as Bert mentioned. First you
>>> should strip the whitespace out of your data:
>>>
>>> dat1 <-read.table(text="Name, Age, Weight
>>>   Alex,  20,  13X
>>>   Bob,  25,  142
>>>   Carol, 24,  120
>>>   John,  3BC,  175
>>>   Katy,  35,  160
>>>   Jack3, 34,  140",sep=",", header=TRUE, stringsAsFactors=FALSE,
>>> strip.white=TRUE)
>>>
>>> Now check to see if all of the fields are character as expected.
>>>
>>> sapply(dat1, typeof)
>>> #Name Age  Weight
>>> # "character" "character" "character"
>>>
>>> Now identify character variables containing numbers and numeric
>>> variables containing characters:
>>>
>>> BadName <- which(grepl("[[:digit:]]", dat1$Name))
>>> BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
>>> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
>>>
>>> Next remove those rows:
>>>
>>> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>>> #Name Age Weight
>>> #  2   Bob  25142
>>> #  3 Carol  24120
>>> #  5  Katy  35160
>>>
>>> You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
>>> as.numeric(dat2$Age).
>>>
>>> David Carlson
>>>
>>>
>>> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter 
>>> wrote:
>>>
 As character 'polluted' entries will cause a column to be read in (via
 read.table and relatives) as factor or character data, this sounds like a
 job for regular expressions. If you are not familiar with this subject,
 time to learn. And, yes, ZjQcmQRYFpfptBannerStart
 This Message Is From an External Sender
 This message came from outside your organization.
 ZjQcmQRYFpfptBannerEnd

 As character 'polluted' entries will cause a column to be read in (via
 read.table and relatives) as factor or character data, this sounds like a
 job for regular expressions. If you are not familiar with this subject,
 time to learn. And, yes, some heavy lifting will be required.
 See ?regexp for a start maybe? Or the stringr package?

 Cheers,
 Bert




 On Fri, Jan 28, 2022, 7:08 PM Val  wrote:

 > Hi All,
 >
 > I want to remove rows that contain a character string in an integer
 > column or a digit in a character column.
 >
 > Sample data
 >
 > dat1 <-read.table(text="Name, Age, Weight
 >  Alex,  20,  13X
 >  Bob,   25,  142
 >  Carol, 24,  120
 >  John,  3BC,  175
 >  Katy,  35,  160
 >  Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F)
 >
 > If the Age/Weight column contains any character(s) then remove
 > if the Name  column contains an digit then remove that row
 > Desired output
 >
 >Name   Age weight
 > 1   Bob 25142
 > 2   Carol   24120
 > 3   Katy35160
 >
 > Thank you,
 >
 > __
 > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 > 

Re: [R] Row exclude

2022-01-29 Thread David Carlson via R-help
It is possible that there would be errors on the same row for different
columns. This does not happen in your example. If row 4 was "John6, 3BC,
175X" then row 4 would be included 3 times, but we only need to remove it
once. Removing the duplicates is not necessary since R would not get
confused, but length(unique(c(BadName, BadAge, BadWeight)) indicates how
many lines are being removed.

David

On Sat, Jan 29, 2022 at 8:32 PM Val  wrote:

> Thank you David for your help. I just have one question on this. What is
> the purpose of  using the "unique" function on this?   (dat2 <-
> dat1[-unique(c(BadName, BadAge, BadWeight)), ])   I got the same result
> without using it. ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
> Thank you David for your help.
>
> I just have one question on this. What is the purpose of  using the
> "unique" function on this?
>   (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>
> I got the same result without using it.
>(dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])
>
> My concern is when I am applying this for the large data set the "unique"
> function may consume resources(time  and memory).
>
> Thank you.
>
> On Sat, Jan 29, 2022 at 12:30 AM David Carlson  wrote:
>
>> Given that you know which columns should be numeric and which should be
>> character, finding characters in numeric columns or numbers in character
>> columns is not difficult. Your data frame consists of three character
>> columns so you can use regular expressions as Bert mentioned. First you
>> should strip the whitespace out of your data:
>>
>> dat1 <-read.table(text="Name, Age, Weight
>>   Alex,  20,  13X
>>   Bob,  25,  142
>>   Carol, 24,  120
>>   John,  3BC,  175
>>   Katy,  35,  160
>>   Jack3, 34,  140",sep=",", header=TRUE, stringsAsFactors=FALSE,
>> strip.white=TRUE)
>>
>> Now check to see if all of the fields are character as expected.
>>
>> sapply(dat1, typeof)
>> #Name Age  Weight
>> # "character" "character" "character"
>>
>> Now identify character variables containing numbers and numeric variables
>> containing characters:
>>
>> BadName <- which(grepl("[[:digit:]]", dat1$Name))
>> BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
>> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
>>
>> Next remove those rows:
>>
>> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>> #Name Age Weight
>> #  2   Bob  25142
>> #  3 Carol  24120
>> #  5  Katy  35160
>>
>> You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
>> as.numeric(dat2$Age).
>>
>> David Carlson
>>
>>
>> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter 
>> wrote:
>>
>>> As character 'polluted' entries will cause a column to be read in (via
>>> read.table and relatives) as factor or character data, this sounds like a
>>> job for regular expressions. If you are not familiar with this subject,
>>> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
>>> This Message Is From an External Sender
>>> This message came from outside your organization.
>>> ZjQcmQRYFpfptBannerEnd
>>>
>>> As character 'polluted' entries will cause a column to be read in (via
>>> read.table and relatives) as factor or character data, this sounds like a
>>> job for regular expressions. If you are not familiar with this subject,
>>> time to learn. And, yes, some heavy lifting will be required.
>>> See ?regexp for a start maybe? Or the stringr package?
>>>
>>> Cheers,
>>> Bert
>>>
>>>
>>>
>>>
>>> On Fri, Jan 28, 2022, 7:08 PM Val  wrote:
>>>
>>> > Hi All,
>>> >
>>> > I want to remove rows that contain a character string in an integer
>>> > column or a digit in a character column.
>>> >
>>> > Sample data
>>> >
>>> > dat1 <-read.table(text="Name, Age, Weight
>>> >  Alex,  20,  13X
>>> >  Bob,   25,  142
>>> >  Carol, 24,  120
>>> >  John,  3BC,  175
>>> >  Katy,  35,  160
>>> >  Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F)
>>> >
>>> > If the Age/Weight column contains any character(s) then remove
>>> > if the Name  column contains an digit then remove that row
>>> > Desired output
>>> >
>>> >Name   Age weight
>>> > 1   Bob 25142
>>> > 2   Carol   24120
>>> > 3   Katy35160
>>> >
>>> > Thank you,
>>> >
>>> > __
>>> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>>> > PLEASE do read the posting guide
>>> > https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>>> > and provide commented, minimal, self-contained, reproducible code.
>>> >
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> __r-h...@r-project.org 

Re: [R] Row exclude

2022-01-29 Thread Val
agree that the user may have a
> more general concept to be considered here. That is the concept of having a
> data.frame where each column is purely numeric consisting of just 0 through
> 9 with perhaps no spaces, periods or commas or anything extraneous, OR
> purely alphabetic with no numerals allowed, and alphabetic in the same
> sense as Rui uses. Mind you, I do not see any reason for always using the
> current locale for something like names of people that may well be written
> with characters from another locale. I would think any string with all
> non-numeric characters might be allowed for the purpose.
>
> Can you write a function that accepts only pure cells that either have no
> numerals or no alphabeticals but not a mixture? The same function can
> initially be applied to all columns of a data.frame that only is supposed
> to contain columns of one kind or the other but not combinations. You might
> begin by reading in all data as character mode with perhaps extraneous
> white space stripped. You then apply the above function to identify any
> rows that contain a mixed alphanumeric item and eliminate all such rows.
> For consistency, you might examine the resulting data.frame and try to
> convert all columns to numeric. Any that fail conversion attempts  are left
> as character but may possibly have anomalies like one of more alphabetic
> items mixed into  an otherwise numeric set of entries. That might require
> another filter run per column to identify those and either remove more rows
> or replace the bad ones with NA or a default like 0 in what is then
> convertable to a numeric column.
>
> But my thought was that it is more complex to design something (as Rui
> did) that takes a list of intended column types, or a function that knows
> how to deal with each, as compared to an all-purpose function that just
> insists on purity at a local level and is a simpler program to write.
>
> Avious
>
> -Original Message-
> From: Rui Barradas 
> To: Avi Gross ; dcarl...@tamu.edu ;
> bgunter.4...@gmail.com 
> Cc: r-help@r-project.org 
> Sent: Sat, Jan 29, 2022 1:33 pm
> Subject: Re: [R] Row exclude
>
>
> Hello,
>
> Thanks for the comments, a few others inline.
>
> Às 18:04 de 29/01/2022, Avi Gross escreveu:
> > There are many creative ways to solve problems and some may get you in
> > trouble if you present them in class while even in some work situations,
> > they may be hard for most to understand, let alone maintain and make
> > changes.
> >
> > This group is amorphous enough that we have people who want "help" who
> > are new to the language, but also people who know plenty and encounter a
> > new kind of problem, and of course people who want to make use of what
> > they see as free labor.
> >
> > Rui presented a very interesting idea and I like some aspects. But if
> > presented to most people, they might have to start looking up things.
> >
> > But I admit I liked some of the ideas he uses and am adding them to my
> > bag of tricks. Some were overkill for this particular requirement but
> > that also makes them more general and useful.
> >
> > First, was the use of locale-independent regular expressions like
> > [[:alpha:]] that match any combination of [:lower:] and [:upper:] and
> > thus are not restricted to ASCII characters. Since I do lots of my
> > activities in languages other than English and well might include names
> > with characters not normally found in English, or not even using an
> > overlapping  alphabet, I can easily encounter items in the Name column
> > that might not match [A-Za-z] but will match with [:alpha:].
> >
> > I don't know if using [:digit:] has benefits over [0-9] and I do note
> > there was no requirement to match more complex numbers than integers so
> > no need to allow periods or scientific notation and so on.
>
> Yes, I used locale-independent regular expressions. It's a habit I
> aquired a while ago. It took some time to stop using character ranges
> but once gone I'm more comfortable with the use of classes like
> [:alpha:] and [:digit:].
> [After all my native language, (Portuguese) has
> cedillas(ES)/cedilhas(PT) and accented letters].
>
> >
> > Then there is the use of mapply. The more general version of the problem
> > presented would include a data.frame with any number of columns, where a
> > subset of the columns might need to be checked for conditions that vary
> > across the columns but may include some broad categories of conditions
> > that might be re-used. If all the conditions are regular expression
> > matches you can build, then you can extend the list Rui used to have

Re: [R] Row exclude

2022-01-29 Thread Bert Gunter
Rui:

You made my day! -- or at least considerably improved it. Your
solution was clever and clear. IMHO, it is also a terrific example of
why one should expend the effort to really learn the core features of
the language before plunging into packages with alternative paradigms.
(But lots of wise folks will disagree, so let's not debate that and
just consider me a luddite if you like).

A minor tweak would be to add punctuation characters to the regex's:

> dig <- '[[:digit:][:punct:]]' ; nondig <- '[[:alpha:][:punct:]]'
> mapply(\(r,x)grepl(r,x),list(dig, nondig, nondig), dat1)

This of course would need to be modified for numeric columns with '.'
or ',' as a decimal separator. Most examples I've seen were of
contamination by a particular character or two (like ',' )) for
numeric entries, which could be easily handled of course.

As usual, one of the virtues of a nice solution like yours is that it
can easily be generalized, say to the case of a data frame with 100's
of columns. One just has to be a bit careful about details.

A usual 'gotcha' will be to ensure that factor columns are read in or
converted to character.  Another is that you need to first remove any
non-character -- typically non-polluted numeric -- columns from the
data frame. This can be done by something like:
dat <- dat[, sapply(dat, is.character)]
Anyway, with those caveats and perhaps others that I either haven't
thought of or may be data-specific, here is an example that
illustrates how nicely your approach extends.

I'll start from the OP's dat1 example.

 dat1 <-read.table(text="Name, Age, Weight
 Alex,  20,  13X
 Bob,   25,  142
 Carol, 24,  120
 John,  3BC,  175
 Katy,  35,  160
 Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F)

## now enlarge the table and add a gender column which should contain
only upper or lower case 'm','f', 'o' ## but which I have corrupted
with some 'g's (typos)

 set.seed(9901)
 genderAbb <- c('M','F','O','m','f','o','g')
 gender <- sample(genderAbb, 24,dim rep = TRUE)
 dat1 <- cbind(dat1[rep(1:6,4),],
Gender = gender
   )
head(dat1, 8)

  Name   Age Weight Gender
1 Alex2013X  O
2  Bob25142  M
3Carol24120  o
4 John   3BC175  o
5 Katy35160  f
6Jack334140  g
1.1   Alex2013X  M
2.1Bob25142  f

## Now create a list of the different target 'types' for columns.
## Note that these types are user-created categories, not R data types.
## So one can use whatever names one wants.
## Or could use numeric values -- but that obfuscates the meaning and
increases the risk of error, imo.
type <- c('char', 'int', 'gend') ## obvious

## Now, using your idea, determine the regex's that identify bad
entries for each type,
 badpat <- list(
 char = '[[:punct:][:digit:]]', ## added stray punctuation
 int = '[[:punct:][:alpha:]]', ## ditto
 gend = '[^MFOmfo]' )  ## the only gender abbreviations
that will be accepted.
## The initial '^' is the regex symbol for 'anything
*but* these in character classes


## Now identify what type of data each column should contain. This is
the part that could be tedious
## for many columns, but I see no way of avoiding it. A smarter UI
than I give would help!
target_type <- c('char','int','int','gend')

## and create the corresponding list of regex patterns to use for mapply()
target_pat <- badpat[target_type]

## Now do the Barradas trick
result <- mapply(\(pat,x)if(is.character(x))grepl(pat, x)
   else rep(FALSE, NROW(x)),
   target_pat,
   dat1)
head(result, 8) ## it's a matrix, not a data frame of course
## ... and then proceed as you showed.
Cheers,
Bert


On Sat, Jan 29, 2022 at 12:46 AM Rui Barradas  wrote:
>
> Hello,
>
> Getting creative, here is another way with mapply.
>
>
> regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]")
>
> i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> dat1[rowSums(i) == 0L, ]
>
> #  Name Age Weight
> #2   Bob   25   142
> #3 Carol   24   120
> #5  Katy   35   160
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Às 06:30 de 29/01/2022, David Carlson via R-help escreveu:
> > Given that you know which columns should be numeric and which should be
> > character, finding characters in numeric columns or numbers in character
> > columns is not difficult. Your data frame consists of three character
> > columns so you can use regular expressions as Bert mentioned. First you
> > should strip the whitespace out of your data:
> >
> > dat1 <-read.table(text="Name, Age, Weight
> >Alex,  20,  13X
> >Bob,  25,  142
> >Carol, 24,  120
> >John,  3BC,  175
> >Katy,  35,  160
> >Jack3, 34,  140",sep=",", header=TRUE, stringsAsFactors=FALSE,
> > strip.white=TRUE)
> >
> > Now check to see if all of the fields are character as expected.
> >
> > sapply(dat1, typeof)
> > #Name Age  Weight
> > 

Re: [R] Row exclude

2022-01-29 Thread Avi Gross via R-help
ns 
contain any characters in "[a-zA-Z]" and return a similar index vector if they 
are OK.

What I would then have are three numeric vectors, not a matrix. Each contains a 
subset of all the indices:


> grep("[0-9]", dat1$Name, invert = TRUE)
[1] 1 2 3 4 5
> grep("[a-zA-Z]", dat1$Age, invert = TRUE)
[1] 1 2 3 5 6
> grep("[a-zA-Z]", dat1$Weight, invert = TRUE)
[1] 2 3 4 5 6

This set of data was designed to toss out one of each column so they all are of 
the same length but need not be. Like Rui, my condition for deciding which rows 
to keep is that all three of the index vectors have a particular entry. He 
summed them as logicals, but my choice has small integers so the way I combine 
them to exclude any not in all three is to use a sort of set intersect method. 
The one built-in to R only handles two at a time so I nested two calls to 
intersect but in a more general case, I would use some package (or build my own 
function) that handles intersecting any number of such items.

Here is the full code, minus the initialization.


rows.keep <-
intersect(intersect(grep("[0-9]", dat1$Name, invert = TRUE),
grep("[a-zA-Z]", dat1$Age, invert = TRUE)),
  grep("[a-zA-Z]", dat1$Weight, invert = TRUE))
result <- dat1[rows.keep,]




-Original Message-
From: Avi Gross via R-help 
To: ruipbarra...@sapo.pt ; dcarl...@tamu.edu 
; bgunter.4...@gmail.com 
Cc: r-help@r-project.org 
Sent: Sat, Jan 29, 2022 1:04 pm
Subject: Re: [R] Row exclude

There are many creative ways to solve problems and some may get you in trouble 
if you present them in class while even in some work situations, they may be 
hard for most to understand, let alone maintain and make changes.
This group is amorphous enough that we have people who want "help" who are new 
to the language, but also people who know plenty and encounter a new kind of 
problem, and of course people who want to make use of what they see as free 
labor.
Rui presented a very interesting idea and I like some aspects. But if presented 
to most people, they might have to start looking up things. 
But I admit I liked some of the ideas he uses and am adding them to my bag of 
tricks. Some were overkill for this particular requirement but that also makes 
them more general and useful.
First, was the use of locale-independent regular expressions like [[:alpha:]] 
that match any combination of [:lower:] and [:upper:] and thus are not 
restricted to ASCII characters. Since I do lots of my activities in languages 
other than English and well might include names with characters not normally 
found in English, or not even using an overlapping  alphabet, I can easily 
encounter items in the Name column that might not match [A-Za-z] but will match 
with [:alpha:].
I don't know if using [:digit:] has benefits over [0-9] and I do note there was 
no requirement to match more complex numbers than integers so no need to allow 
periods or scientific notation and so on.
Then there is the use of mapply. The more general version of the problem 
presented would include a data.frame with any number of columns, where a subset 
of the columns might need to be checked for conditions that vary across the 
columns but may include some broad categories of conditions that might be 
re-used. If all the conditions are regular expression matches you can build, 
then you can extend the list Rui used to have more items and also include 
expressions that always match so that some columns are effectively ignored:

   regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]", "[.*])


So this generalizes to N columns as long as you supply exactly N patterns in 
the list, albeit mapply does recycle arguments if needed as in the simplest 
case where you want all columns checked the same way.
Rui then uses an anonymous function to pass to mapply() and that is a newish 
feature added recently to R, I think. It was perhaps meant specifically to be 
used with the new pipe symbol, but can be used anywhere but perhaps not in 
older versions of R.

   \(x, r) grepl(r, x)


I note Rui also uses grepl() which returns a logical vector. I will show my 
first attempt at the end where I used grep() to return index numbers of matches 
instead. For this context, though, he made use of the fact that mapply in this 
case returns a matrix of type logical:
i <- mapply(\(x, r) grepl(r, x), dat1, regex)

> i
      Name   Age Weight[1,] FALSE FALSE   TRUE[2,] FALSE FALSE  FALSE[3,] FALSE 
FALSE  FALSE[4,] FALSE  TRUE  FALSE[5,] FALSE FALSE  FALSE[6,]  TRUE FALSE  
FALSE
And since R treats TRUE as 1 and FALSE as 0, then summing the rows gives you a 
small integer between 0 and the number of columns, inclusive, and only rows 
with no TRUE in them are wanted for this purpose:

dat1[rowSums(i) == 0L, ]

All I all, nicely done, but not trivial to read without comments, LOL!
And, y

Re: [R] Row exclude

2022-01-29 Thread Rui Barradas
 
What goes in is a vector of individual items from a column of the data. 
What goes out is the indices of which ones I want to keep that can be 
used to index the entire data.frame. Based on the ample data, it returns 
1:5 as row 6 has a digit in "Jack3".



   grep("[0-9]", dat1$Name, invert = TRUE)


Similarly, two other grep() statements test if the second and third 
columns contain any characters in "[a-zA-Z]" and return a similar index 
vector if they are OK.


What I would then have are three numeric vectors, not a matrix. Each 
contains a subset of all the indices:




grep("[0-9]", dat1$Name, invert = TRUE)

[1] 1 2 3 4 5

grep("[a-zA-Z]", dat1$Age, invert = TRUE)

[1] 1 2 3 5 6

grep("[a-zA-Z]", dat1$Weight, invert = TRUE)

[1] 2 3 4 5 6

This set of data was designed to toss out one of each column so they all 
are of the same length but need not be. Like Rui, my condition for 
deciding which rows to keep is that all three of the index vectors have 
a particular entry. He summed them as logicals, but my choice has small 
integers so the way I combine them to exclude any not in all three is to 
use a sort of set intersect method. The one built-in to R only handles 
two at a time so I nested two calls to intersect but in a more general 
case, I would use some package (or build my own function) that handles 
intersecting any number of such items.


Here is the full code, minus the initialization.


rows.keep <-
intersect(intersect(grep("[0-9]", dat1$Name, invert = TRUE),
                     grep("[a-zA-Z]", dat1$Age, invert = TRUE)),
           grep("[a-zA-Z]", dat1$Weight, invert = TRUE))
result <- dat1[rows.keep,]




Using the same idea, another two options, both with Reduce.

The 1st uses Avi's grep and regex's, the latter could be the character 
classes "[[:alpha:]]" and "[[:digit:]]" but this code is inspired in 
his. The results are put on a list and Reduce intersects the list 
members. Then subsetting is as usual.


The 2nd uses the fact that Mapis a wrapper for mapply that defaults to 
not simplifying its output. grep/invert will find the non-matches and 
Reduce intersects the result list, as above.

From ?Map:

Map is a simple wrapper to mapply which does not attempt to simplify the 
result, similar to Common Lisp's mapcar (with arguments being recycled, 
however). Future versions may allow some control of the result type.


# 1st
grep_list <- list(
  grep("[0-9]", dat1$Name, invert = TRUE),
  grep("[a-zA-Z]", dat1$Age, invert = TRUE),
  grep("[a-zA-Z]", dat1$Weight, invert = TRUE)
)
keep1 <- Reduce(intersect, grep_list)
dat1[keep1,]

# 2nd
keep2 <- Map(\(x, r) grep(r, x, invert = TRUE), dat1, regex)
keep2 <- Reduce(intersect, keep2)

identical(keep1, keep2)
#[1] TRUE


Hope this helps,

Rui Barradas











-Original Message-
From: Rui Barradas 
To: David Carlson ; Bert Gunter 
Cc: r-help@R-project.org (r-help@r-project.org) 
Sent: Sat, Jan 29, 2022 3:46 am
Subject: Re: [R] Row exclude

Hello,

Getting creative, here is another way with mapply.


regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]")

i <- mapply(\(x, r) grepl(r, x), dat1, regex)
dat1[rowSums(i) == 0L, ]

#  Name Age Weight
#2   Bob   25       142
#3 Carol   24       120
#5  Katy   35   160


Hope this helps,

Rui Barradas


Às 06:30 de 29/01/2022, David Carlson via R-help escreveu:
 > Given that you know which columns should be numeric and which should be
 > character, finding characters in numeric columns or numbers in character
 > columns is not difficult. Your data frame consists of three character
 > columns so you can use regular expressions as Bert mentioned. First you
 > should strip the whitespace out of your data:
 >
 > dat1 <-read.table(text="Name, Age, Weight
 >    Alex,  20,  13X
 >    Bob,  25,  142
 >    Carol, 24,  120
 >    John,  3BC,  175
 >    Katy,  35,  160
 >    Jack3, 34,  140",sep=",", header=TRUE, stringsAsFactors=FALSE,
 > strip.white=TRUE)
 >
 > Now check to see if all of the fields are character as expected.
 >
 > sapply(dat1, typeof)
 > #        Name        Age      Weight
 > # "character" "character" "character"
 >
 > Now identify character variables containing numbers and numeric variables
 > containing characters:
 >
 > BadName <- which(grepl("[[:digit:]]", dat1$Name))
 > BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
 > BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
 >
 > Next remove those rows:
 >
 > (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
 > #    Name Age Weight
 > #  2  Bob  25    142
 > #  3 Carol  24    120
 > #  5  Katy  35    160
 >
 > You still 

Re: [R] Row exclude

2022-01-29 Thread Avi Gross via R-help
; grep("[a-zA-Z]", dat1$Weight, invert = 
> TRUE)[1] 2 3 4 5 6
This set of data was designed to toss out one of each column so they all are of 
the same length but need not be. Like Rui, my condition for deciding which rows 
to keep is that all three of the index vectors have a particular entry. He 
summed them as logicals, but my choice has small integers so the way I combine 
them to exclude any not in all three is to use a sort of set intersect method. 
The one built-in to R only handles two at a time so I nested two calls to 
intersect but in a more general case, I would use some package (or build my own 
function) that handles intersecting any number of such items.
Here is the full code, minus the initialization.

rows.keep <-intersect(intersect(grep("[0-9]", dat1$Name, invert = TRUE),        
            grep("[a-zA-Z]", dat1$Age, invert = TRUE)),          
grep("[a-zA-Z]", dat1$Weight, invert = TRUE))result <- dat1[rows.keep,]










-Original Message-----
From: Rui Barradas 
To: David Carlson ; Bert Gunter 
Cc: r-help@R-project.org (r-help@r-project.org) 
Sent: Sat, Jan 29, 2022 3:46 am
Subject: Re: [R] Row exclude

Hello,

Getting creative, here is another way with mapply.


regex <- list("[[:digit:]]", "[[:alpha:]]", "[[:alpha:]]")

i <- mapply(\(x, r) grepl(r, x), dat1, regex)
dat1[rowSums(i) == 0L, ]

#  Name Age Weight
#2   Bob   25       142
#3 Carol   24       120
#5  Katy   35   160


Hope this helps,

Rui Barradas


Às 06:30 de 29/01/2022, David Carlson via R-help escreveu:
> Given that you know which columns should be numeric and which should be
> character, finding characters in numeric columns or numbers in character
> columns is not difficult. Your data frame consists of three character
> columns so you can use regular expressions as Bert mentioned. First you
> should strip the whitespace out of your data:
>
> dat1 <-read.table(text="Name, Age, Weight
>    Alex,  20,  13X
>    Bob,  25,  142
>    Carol, 24,  120
>    John,  3BC,  175
>    Katy,  35,  160
>    Jack3, 34,  140",sep=",", header=TRUE, stringsAsFactors=FALSE,
> strip.white=TRUE)
>
> Now check to see if all of the fields are character as expected.
>
> sapply(dat1, typeof)
> #        Name        Age      Weight
> # "character" "character" "character"
>
> Now identify character variables containing numbers and numeric variables
> containing characters:
>
> BadName <- which(grepl("[[:digit:]]", dat1$Name))
> BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
>
> Next remove those rows:
>
> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
> #    Name Age Weight
> #  2  Bob  25    142
> #  3 Carol  24    120
> #  5  Katy  35    160
>
> You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
> as.numeric(dat2$Age).
>
> David Carlson
>
>
> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter  wrote:
>
>> As character 'polluted' entries will cause a column to be read in (via
>> read.table and relatives) as factor or character data, this sounds like a
>> job for regular expressions. If you are not familiar with this subject,
>> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> ZjQcmQRYFpfptBannerEnd
>>
>> As character 'polluted' entries will cause a column to be read in (via
>> read.table and relatives) as factor or character data, this sounds like a
>> job for regular expressions. If you are not familiar with this subject,
>> time to learn. And, yes, some heavy lifting will be required.
>> See ?regexp for a start maybe? Or the stringr package?
>>
>> Cheers,
>> Bert
>>
>>
>>
>>
>> On Fri, Jan 28, 2022, 7:08 PM Val  wrote:
>>
>>> Hi All,
>>>
>>> I want to remove rows that contain a character string in an integer
>>> column or a digit in a character column.
>>>
>>> Sample data
>>>
>>> dat1 <-read.table(text="Name, Age, Weight
>>>  Alex,  20,  13X
>>>  Bob,  25,  142
>>>  Carol, 24,  120
>>>  John,  3BC,  175
>>>  Katy,  35,  160
>>>  Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F)
>>>
>>> If the Age/Weight column contains any character(s) then remove
>>> if the Name  column contains an digit then remove that row
>>> Desired output
>>>
>>>    Name  Age weight
>>> 1  Bob    25    142
>>> 2  Carol  24    1

Re: [R] Row exclude

2022-01-28 Thread Bert Gunter
As character 'polluted' entries will cause a column to be read in (via
read.table and relatives) as factor or character data, this sounds like a
job for regular expressions. If you are not familiar with this subject,
time to learn. And, yes, some heavy lifting will be required.
See ?regexp for a start maybe? Or the stringr package?

Cheers,
Bert




On Fri, Jan 28, 2022, 7:08 PM Val  wrote:

> Hi All,
>
> I want to remove rows that contain a character string in an integer
> column or a digit in a character column.
>
> Sample data
>
> dat1 <-read.table(text="Name, Age, Weight
>  Alex,  20,  13X
>  Bob,   25,  142
>  Carol, 24,  120
>  John,  3BC,  175
>  Katy,  35,  160
>  Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F)
>
> If the Age/Weight column contains any character(s) then remove
> if the Name  column contains an digit then remove that row
> Desired output
>
>Name   Age weight
> 1   Bob 25142
> 2   Carol   24120
> 3   Katy35160
>
> Thank you,
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.