Re: [R] Row exclude

2022-01-30 Thread David Carlson via R-help
You need to add "-": ` (dat3 <- dat1[-unique(c(BadName, BadAge,
BadWeight)), ])` which makes the command NOT).

David

On Sun, Jan 30, 2022 at 11:00 AM Val  wrote:

> Thank you David. What about if I want to list the excluded rows? I used
> this (dat3 <- dat1[unique(c(BadName, BadAge, BadWeight)), ]) It did not
> work.The desired output  is,   Alex,  20,  13X  John,  3BC, 175  Jack3, 34,
>  140 ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
> Thank you David.
>
> What about if I want to list the excluded rows?
> I used this
> (dat3 <- dat1[unique(c(BadName, BadAge, BadWeight)), ])
>
> It did not work.The desired output  is,
>   Alex,  20,  13X
>  John,  3BC, 175
>  Jack3, 34,  140
>
> Thank you,
>
> On Sat, Jan 29, 2022 at 10:15 PM David Carlson  wrote:
>
>> It is possible that there would be errors on the same row for different
>> columns. This does not happen in your example. If row 4 was "John6, 3BC,
>> 175X" then row 4 would be included 3 times, but we only need to remove it
>> once. Removing the duplicates is not necessary since R would not get
>> confused, but length(unique(c(BadName, BadAge, BadWeight)) indicates how
>> many lines are being removed.
>>
>> David
>>
>> On Sat, Jan 29, 2022 at 8:32 PM Val  wrote:
>>
>>> Thank you David for your help. I just have one question on this. What is
>>> the purpose of  using the "unique" function on this?   (dat2 <-
>>> dat1[-unique(c(BadName, BadAge, BadWeight)), ])   I got the same result
>>> without using it. ZjQcmQRYFpfptBannerStart
>>> This Message Is From an External Sender
>>> This message came from outside your organization.
>>> ZjQcmQRYFpfptBannerEnd
>>> Thank you David for your help.
>>>
>>> I just have one question on this. What is the purpose of  using the
>>> "unique" function on this?
>>>   (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>>>
>>> I got the same result without using it.
>>>(dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])
>>>
>>> My concern is when I am applying this for the large data set the
>>> "unique"  function may consume resources(time  and memory).
>>>
>>> Thank you.
>>>
>>> On Sat, Jan 29, 2022 at 12:30 AM David Carlson 
>>> wrote:
>>>
 Given that you know which columns should be numeric and which should be
 character, finding characters in numeric columns or numbers in character
 columns is not difficult. Your data frame consists of three character
 columns so you can use regular expressions as Bert mentioned. First
 you should strip the whitespace out of your data:

 dat1 <-read.table(text="Name, Age, Weight
   Alex,  20,  13X
   Bob,  25,  142
   Carol, 24,  120
   John,  3BC,  175
   Katy,  35,  160
   Jack3, 34,  140",sep=",", header=TRUE, stringsAsFactors=FALSE,
 strip.white=TRUE)

 Now check to see if all of the fields are character as expected.

 sapply(dat1, typeof)
 #Name Age  Weight
 # "character" "character" "character"

 Now identify character variables containing numbers and numeric
 variables containing characters:

 BadName <- which(grepl("[[:digit:]]", dat1$Name))
 BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
 BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))

 Next remove those rows:

 (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
 #Name Age Weight
 #  2   Bob  25142
 #  3 Carol  24120
 #  5  Katy  35160

 You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
 as.numeric(dat2$Age).

 David Carlson


 On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter 
 wrote:

> As character 'polluted' entries will cause a column to be read in (via
> read.table and relatives) as factor or character data, this sounds like a
> job for regular expressions. If you are not familiar with this subject,
> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
>
> As character 'polluted' entries will cause a column to be read in (via
> read.table and relatives) as factor or character data, this sounds like a
> job for regular expressions. If you are not familiar with this subject,
> time to learn. And, yes, some heavy lifting will be required.
> See ?regexp for a start maybe? Or the stringr package?
>
> Cheers,
> Bert
>
>
>
>
> On Fri, Jan 28, 2022, 7:08 PM Val  wrote:
>
> > Hi All,
> >
> > I want to remove rows that contain a character string in an integer
> > column or a digit in a character column.
> >
> > Sample data
> >
> > dat1 <-read.table(text="Name, Age, Weight
> >  Alex,  20,  13X
> >  Bob,   25,  142
> 

Re: [R] Weird behaviour of order() when having multiple ties

2022-01-30 Thread Jeff Newmiller
Why should 6,5 be more correct than 5,6? How is R supposed to reach that 
conclusion based on comparing values?


On January 30, 2022 1:16:44 AM PST, Stefan Fleck  
wrote:
>I am experiencing a weird behavior of `order()` for numeric vectors. I
>tested on 3.6.2 and 4.1.2 for windows and R 4.0.2 on ubuntu. Can anyone
>confirm?
>
>order(
>  c(
>0.6,
>0.5,
>0.3,
>0.2,
>0.1,
>0.1
>  )
>)
>## Result [should be in order]
>[1] 5 6 4 3 2 1
>
>The sort order is obviously wrong. This only occurs if i have multiple
>ties. The problem does _not_ occur for decreasing = TRUE.
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Weird behaviour of order() when having multiple ties

2022-01-30 Thread Rui Barradas

Hello,

I am not seeing an error, the order is right:

x <- c(
  0.6,
  0.5,
  0.3,
  0.2,
  0.1,
  0.1
)
(i <- order(x))
#> [1] 5 6 4 3 2 1
x[i]
#> [1] 0.1 0.1 0.2 0.3 0.5 0.6


Hope this helps,

Rui Barradas

Às 09:16 de 30/01/2022, Stefan Fleck escreveu:

I am experiencing a weird behavior of `order()` for numeric vectors. I
tested on 3.6.2 and 4.1.2 for windows and R 4.0.2 on ubuntu. Can anyone
confirm?

order(
   c(
 0.6,
 0.5,
 0.3,
 0.2,
 0.1,
 0.1
   )
)
## Result [should be in order]
[1] 5 6 4 3 2 1

The sort order is obviously wrong. This only occurs if i have multiple
ties. The problem does _not_ occur for decreasing = TRUE.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Row exclude

2022-01-30 Thread Val
Thank you David.

What about if I want to list the excluded rows?
I used this
(dat3 <- dat1[unique(c(BadName, BadAge, BadWeight)), ])

It did not work.The desired output  is,
  Alex,  20,  13X
 John,  3BC, 175
 Jack3, 34,  140

Thank you,

On Sat, Jan 29, 2022 at 10:15 PM David Carlson  wrote:

> It is possible that there would be errors on the same row for different
> columns. This does not happen in your example. If row 4 was "John6, 3BC,
> 175X" then row 4 would be included 3 times, but we only need to remove it
> once. Removing the duplicates is not necessary since R would not get
> confused, but length(unique(c(BadName, BadAge, BadWeight)) indicates how
> many lines are being removed.
>
> David
>
> On Sat, Jan 29, 2022 at 8:32 PM Val  wrote:
>
>> Thank you David for your help. I just have one question on this. What is
>> the purpose of  using the "unique" function on this?   (dat2 <-
>> dat1[-unique(c(BadName, BadAge, BadWeight)), ])   I got the same result
>> without using it. ZjQcmQRYFpfptBannerStart
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> ZjQcmQRYFpfptBannerEnd
>> Thank you David for your help.
>>
>> I just have one question on this. What is the purpose of  using the
>> "unique" function on this?
>>   (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>>
>> I got the same result without using it.
>>(dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])
>>
>> My concern is when I am applying this for the large data set the
>> "unique"  function may consume resources(time  and memory).
>>
>> Thank you.
>>
>> On Sat, Jan 29, 2022 at 12:30 AM David Carlson  wrote:
>>
>>> Given that you know which columns should be numeric and which should be
>>> character, finding characters in numeric columns or numbers in character
>>> columns is not difficult. Your data frame consists of three character
>>> columns so you can use regular expressions as Bert mentioned. First you
>>> should strip the whitespace out of your data:
>>>
>>> dat1 <-read.table(text="Name, Age, Weight
>>>   Alex,  20,  13X
>>>   Bob,  25,  142
>>>   Carol, 24,  120
>>>   John,  3BC,  175
>>>   Katy,  35,  160
>>>   Jack3, 34,  140",sep=",", header=TRUE, stringsAsFactors=FALSE,
>>> strip.white=TRUE)
>>>
>>> Now check to see if all of the fields are character as expected.
>>>
>>> sapply(dat1, typeof)
>>> #Name Age  Weight
>>> # "character" "character" "character"
>>>
>>> Now identify character variables containing numbers and numeric
>>> variables containing characters:
>>>
>>> BadName <- which(grepl("[[:digit:]]", dat1$Name))
>>> BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
>>> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
>>>
>>> Next remove those rows:
>>>
>>> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>>> #Name Age Weight
>>> #  2   Bob  25142
>>> #  3 Carol  24120
>>> #  5  Katy  35160
>>>
>>> You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
>>> as.numeric(dat2$Age).
>>>
>>> David Carlson
>>>
>>>
>>> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter 
>>> wrote:
>>>
 As character 'polluted' entries will cause a column to be read in (via
 read.table and relatives) as factor or character data, this sounds like a
 job for regular expressions. If you are not familiar with this subject,
 time to learn. And, yes, ZjQcmQRYFpfptBannerStart
 This Message Is From an External Sender
 This message came from outside your organization.
 ZjQcmQRYFpfptBannerEnd

 As character 'polluted' entries will cause a column to be read in (via
 read.table and relatives) as factor or character data, this sounds like a
 job for regular expressions. If you are not familiar with this subject,
 time to learn. And, yes, some heavy lifting will be required.
 See ?regexp for a start maybe? Or the stringr package?

 Cheers,
 Bert




 On Fri, Jan 28, 2022, 7:08 PM Val  wrote:

 > Hi All,
 >
 > I want to remove rows that contain a character string in an integer
 > column or a digit in a character column.
 >
 > Sample data
 >
 > dat1 <-read.table(text="Name, Age, Weight
 >  Alex,  20,  13X
 >  Bob,   25,  142
 >  Carol, 24,  120
 >  John,  3BC,  175
 >  Katy,  35,  160
 >  Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F)
 >
 > If the Age/Weight column contains any character(s) then remove
 > if the Name  column contains an digit then remove that row
 > Desired output
 >
 >Name   Age weight
 > 1   Bob 25142
 > 2   Carol   24120
 > 3   Katy35160
 >
 > Thank you,
 >
 > __
 > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4

Re: [R] progress of LDA algorithm...

2022-01-30 Thread Bert Gunter
I am not an expert, but I believe your extrapolation idea is unsound.
Again, post on the HPC list to get expert feedback instead of trying
to reinvent your own wheel. I will not respond further.

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Sun, Jan 30, 2022 at 3:02 AM akshay kulkarni  wrote:
>
> dear Avi and Bert,
>   I think I got my answer. I will just run it 
> with a small sample and check the execution time and extrapolate from that. 
> By the way, LDA (I am using topicmodels package) cannot be parallelized, 
> right? Thanks in advance.
>
> Thanking you,
> Yours sincerely,
> AKSHAY M KULKARNI
> 
> From: R-help  on behalf of Avi Gross via R-help 
> 
> Sent: Sunday, January 30, 2022 4:15 AM
> Cc: r-help@r-project.org 
> Subject: Re: [R] progress of LDA algorithm...
>
> I agree with Bert that this is way off topic and one few here know (or care) 
> about.
>
> Generally, if a package has functionality with manual pages, it may have 
> abilities defined such as setting verbose=TRUE or to various levels of output 
> that may satisfy the request or they may make a copy of code including their 
> print or logging statements and so on.
>
> If the request is more general such as how to run a program under some 
> debugging method and set checkpoints at which some reporting is done, that 
> too is a bit outside the normal uses of this forum.
>
> The usual suggestion here is to contact the package maintainer, with no 
> guarantee of getting any useful response, or find a forum way more specific 
> than R HELP just because part of the package is in R.
>
> As it happens, the lda() function being discussed may (or may not) be in the 
> MASS package. Looking at the documentation, I saw no obvious hook to show it 
> as it makes progress. Of course Akshay can do some external testing using 
> standard R timing mechanisms to see how long it takes to do just some of the 
> news categories without going in to the details of the function called and 
> that might partially answer his question. Asking how to do that might fit the 
> parameters here.
>
>
> -Original Message-
> From: Bert Gunter 
> To: akshay kulkarni 
> Cc: R help Mailing list 
> Sent: Sat, Jan 29, 2022 3:34 pm
> Subject: Re: [R] progress of LDA algorithm...
>
>
> I presume this is in some specialized package that you have not told
> us about -- topicmodels maybe? It is therefore off topic here. In any
> case, this is the sort of question for which you should contact the
> package maintainer (?maintainer).
>
> As your question may also intersect with high performance computing
> considerations, you might want to post  it on the R-Sig-HPC list,
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
> On Sat, Jan 29, 2022 at 8:27 AM akshay kulkarni  wrote:
> >
> > dear members,
> >   I want to run LDA(latent Dirichlet allocation) on 
> > certain news articles. i have the following questions:
> >
> >
> >   1.  Is there any way to know the progress of the execution of the LDA 
> > algorithm?
> >   2.  I read in SO that if you have more memory, faster is the execution 
> > time of LDA. I am using AWS z1d instance with 48 cores and about 325 GB 
> > RAM. I have multiple categories of news, but one of them is much larger 
> > than others, containing about 25000 articles. Is it preferable to send 
> > those categories individually to different processors, and whether R frees 
> > up the memory after running on the smaller categories so that the largest 
> > category can run with more memory? Or is it preferable to first run the 
> > smaller sets, finish the job, and then run the largest category?
> >
> > Thanking You,
> > Yours sincerely,
> > AKSHAY M KULKARNI
> >
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-