[R] Filtering data with dplyr or grep and losing data?
Hello Experts: I have this log file that has about 1200 characters (max) on a line. What I want to do is read this first and then extract certain portions of the file into new columns. I want to extract rows that contain the text “[DF_API: input string]”. When I read it and then filter based on the rows that I am interested, it almost seems like I am losing data. I tried this using the dplyr filter and using standard grep with the same result. Not sure why this is the case. Appreciate your help with this. The code and the data is there at the following link. Satish Code is given below library(dplyr) setwd("C:/Users/satis/Documents/VF/df_issue_dec01") sec1 <- read.delim(file="secondary1_aa_small.log") head(sec1) names(sec1) <- c("V1") sec1_test <- filter(sec1,str_detect(V1,"DF_API: input string")==TRUE) head(sec1_test) sec1_test2 = sec1[grep("DF_API: input string",sec1$V1, perl = TRUE),] head(sec1_test2) write.csv(sec1_test, file = "test_out.txt", row.names = F, quote = F) write.csv(sec1_test2, file = "test2_out.txt", row.names = F, quote = F) Data (and code) is given at the link below. Sorry, I should have used dput. https://spaces.hightail.com/space/arJlYkgIev Satish Vadlamani [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to group by and get distinct rows of of grouped rows based on certain criteria
Thank you Bill and Sarah for your help. I was able to do the same with dplyr with the following code. But I could not post this since at that time my message was not posted yet. >> file1 <- select(file1, ATP.Group,Business.Event,Category) file1_1 <- file1 %>% group_by(ATP.Group,Business.Event) %>% filter(Category == "EQ") %>% distinct(ATP.Group,Business.Event) file1_1 <- as.data.frame(file1_1) file1_1 file1_2 <- file1 %>% group_by(ATP.Group,Business.Event) %>% distinct(ATP.Group,Business.Event) file1_2 <- as.data.frame(file1_2) file1_2 setdiff(select(file1_2,ATP.Group,Business.Event), select(file1_1,ATP.Group,Business.Event)) >> On Thu, Jul 14, 2016 at 1:53 PM, William Dunlap wrote: > > txt <- "|ATP Group|Business Event|Category| > |02 |A |AC | > |02 |A |AD | > |02 |A |EQ | > |ZM |A |AU | > |ZM |A |AV | > |ZM |A |AW | > |02 |B |AC | > |02 |B |AY | > |02 |B |EQ | > " > > d <- read.table(sep="|", text=txt, header=TRUE, strip.white=TRUE, > check.names=FALSE)[,2:4] > > str(d) > 'data.frame': 9 obs. of 3 variables: > $ ATP Group : Factor w/ 2 levels "02","ZM": 1 1 1 2 2 2 1 1 1 > $ Business Event: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2 > $ Category : Factor w/ 7 levels "AC","AD","AU",..: 1 2 7 3 4 5 1 6 7 > > unique(d[d[,"Category"]!="EQ", c("ATP Group", "Business Event")]) > ATP Group Business Event > 102 A > 4ZM A > 702 B > > unique(d[d[,"Category"]=="EQ", c("ATP Group", "Business Event")]) > ATP Group Business Event > 302 A > 902 B > > Some folks prefer to use subset() instead of "[". The previous expression > is equivalent to: > > > unique( subset(d, Category=="EQ", c("ATP Group", "Business Event"))) > ATP Group Business Event > 302 A > 902 B > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Thu, Jul 14, 2016 at 12:43 PM, Satish Vadlamani < > satish.vadlam...@gmail.com> wrote: > >> Hello All: >> I would like to get your help on the following problem. >> >> I have the following data and the first row is the header. Spaces are not >> important. >> I want to find out distinct combinations of ATP Group and Business Event >> (these are the field names that you can see in the data below) that have >> the Category EQ (Category is the third field) and those that do not have >> the category EQ. In the example below, the combinations 02/A and 02/B have >> EQ and the combination ZM/A does not. >> >> If I have a larger file, how to get to this answer? >> >> What did I try (with dplyr)? >> >> # I know that the below is not correct and not giving desired results >> file1_1 <- file1 %>% group_by(ATP.Group,Business.Event) %>% >> filter(Category != "EQ") %>% distinct(ATP.Group,Business.Event) >> # for some reason, I have to convert to data.frame to print the data >> correctly >> file1_1 <- as.data.frame(file1_1) >> file1_1 >> >> >> *Data shown below* >> |ATP Group|Business Event|Category| >> |02 |A |AC | >> |02 |A |AD | >> |02 |A |EQ | >> |ZM |A |AU | >> |ZM |A |AV | >> |ZM |A |AW | >> |02 |B |AC | >> |02 |B |AY | >> |02 |B |EQ | >> >> -- >> >> Satish Vadlamani >> >> [[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > -- Satish Vadlamani [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to group by and get distinct rows of of grouped rows based on certain criteria
Hello All: I would like to get your help on the following problem. I have the following data and the first row is the header. Spaces are not important. I want to find out distinct combinations of ATP Group and Business Event (these are the field names that you can see in the data below) that have the Category EQ (Category is the third field) and those that do not have the category EQ. In the example below, the combinations 02/A and 02/B have EQ and the combination ZM/A does not. If I have a larger file, how to get to this answer? What did I try (with dplyr)? # I know that the below is not correct and not giving desired results file1_1 <- file1 %>% group_by(ATP.Group,Business.Event) %>% filter(Category != "EQ") %>% distinct(ATP.Group,Business.Event) # for some reason, I have to convert to data.frame to print the data correctly file1_1 <- as.data.frame(file1_1) file1_1 *Data shown below* |ATP Group|Business Event|Category| |02 |A |AC | |02 |A |AD | |02 |A |EQ | |ZM |A |AU | |ZM |A |AV | |ZM |A |AW | |02 |B |AC | |02 |B |AY | |02 |B |EQ | -- Satish Vadlamani [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] question - how to subcribe to this list
Hello All: I posted one question in the past and another today and hope to get the same excellent help that I got last time. My question is this: is there any way to subcribe to the forum so that I can see the questions and answers posted to r-help? Thanks, -- Satish Vadlamani [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] what is the best way to process the following data?
Hello, I have multiple text files with the format shown below (see the two files that I pasted below). Each file is a log of multiple steps that the system has processed and for each step, it has shown the start time of the process step. For example, in the data below, the filter started at |06/16/2016|03:44:16 How to read this data so that Step 001 is one data frame, Step 002 is another, and so on. After I do this, I will then compare the Step 001 times with and without parallel process. For example, the files pasted below "no_parallel_process_SLS_4.txt" and "parallel_process_SLS_4.txt" will make it clear what I am trying to do. I want to compare the parallel process times taken for each step with the non parallel process times. If there are better ways of performing this task that what I am thinking, could you let me know? Thanks in advance. Satish Vadlamani >> parallel_process_file.txt |06/16/2016|03:44:16|Step 001 |06/16/2016|03:44:16|Initialization |06/16/2016|03:44:16|Filters |06/16/2016|03:45:03|Split Items |06/16/2016|03:46:20|Sort |06/16/2016|03:46:43|Check |06/16/2016|04:01:13|Save |06/16/2016|04:04:35|Update preparation |06/16/2016|04:04:36|Update comparison |06/16/2016|04:04:38|Update |06/16/2016|04:04:38|Update |06/16/2016|04:06:01|Close |06/16/2016|04:06:33|BOP processing for 7,960 items has finished |06/16/2016|04:06:34|Step 002 |06/16/2016|04:06:35|Initialization |06/16/2016|04:06:35|Filters |06/16/2016|04:07:14|Split Items |06/16/2016|04:08:57|Sort |06/16/2016|04:09:06|Check |06/16/2016|04:26:36|Save |06/16/2016|04:39:29|Update preparation |06/16/2016|04:39:31|Update comparison |06/16/2016|04:39:43|Update |06/16/2016|04:39:45|Update |06/16/2016|04:44:28|Close |06/16/2016|04:45:26|BOP processing for 8,420 items has finished |06/16/2016|04:45:27|Step 003 |06/16/2016|04:45:27|Initialization |06/16/2016|04:45:27|Filters |06/16/2016|04:48:50|Split Items |06/16/2016|04:55:15|Sort |06/16/2016|04:55:40|Check |06/16/2016|05:13:35|Save |06/16/2016|05:17:34|Update preparation |06/16/2016|05:17:34|Update comparison |06/16/2016|05:17:36|Update |06/16/2016|05:17:36|Update |06/16/2016|05:19:29|Close |06/16/2016|05:19:49|BOP processing for 8,876 items has finished |06/16/2016|05:19:50|Step 004 |06/16/2016|05:19:50|Initialization |06/16/2016|05:19:50|Filters |06/16/2016|05:20:43|Split Items |06/16/2016|05:22:14|Sort |06/16/2016|05:22:29|Check |06/16/2016|05:37:27|Save |06/16/2016|05:38:43|Update preparation |06/16/2016|05:38:44|Update comparison |06/16/2016|05:38:45|Update |06/16/2016|05:38:45|Update |06/16/2016|05:39:09|Close |06/16/2016|05:39:19|BOP processing for 5,391 items has finished |06/16/2016|05:39:20|Step 005 |06/16/2016|05:39:20|Initialization |06/16/2016|05:39:20|Filters |06/16/2016|05:39:57|Split Items |06/16/2016|05:40:21|Sort |06/16/2016|05:40:24|Check |06/16/2016|05:46:01|Save |06/16/2016|05:46:54|Update preparation |06/16/2016|05:46:54|Update comparison |06/16/2016|05:46:54|Update |06/16/2016|05:46:55|Update |06/16/2016|05:47:24|Close |06/16/2016|05:47:31|BOP processing for 3,016 items has finished |06/16/2016|05:47:32|Step 006 |06/16/2016|05:47:32|Initialization |06/16/2016|05:47:32|Filters |06/16/2016|05:47:32|Update preparation |06/16/2016|05:47:32|Update comparison |06/16/2016|05:47:32|Update |06/16/2016|05:47:32|Close |06/16/2016|05:47:33|BOP processing for 0 items has finished |06/16/2016|05:47:33|Step 007 |06/16/2016|05:47:33|Initialization |06/16/2016|05:47:33|Filters |06/16/2016|05:47:34|Split Items |06/16/2016|05:47:34|Sort |06/16/2016|05:47:34|Check |06/16/2016|05:47:37|Save |06/16/2016|05:47:37|Update preparation |06/16/2016|05:47:37|Update comparison |06/16/2016|05:47:37|Update |06/16/2016|05:47:37|Update |06/16/2016|05:47:37|Close |06/16/2016|05:47:37|BOP processing for 9 items has finished |06/16/2016|05:47:37|Step 008 |06/16/2016|05:47:37|Initialization |06/16/2016|05:47:37|Filters |06/16/2016|05:47:38|Update preparation |06/16/2016|05:47:38|Update comparison |06/16/2016|05:47:38|Update |06/16/2016|05:47:38|Close |06/16/2016|05:47:38|BOP processing for 0 items has finished >> no_parallel_process_file.txt |06/15/2016|22:52:46|Step 001 |06/15/2016|22:52:46|Initialization |06/15/2016|22:52:46|Filters |06/15/2016|22:54:21|Split Items |06/15/2016|22:55:10|Sort |06/15/2016|22:55:15|Check |06/15/2016|23:04:43|Save |06/15/2016|23:06:38|Update preparation |06/15/2016|23:06:38|Update comparison |06/15/2016|23:06:39|Update |06/15/2016|23:06:39|Update |06/15/2016|23:12:04|Close |06/15/2016|23:13:16|BOP processing for 7,942 items has finished |06/15/2016|23:13:17|Step 002 |06/15/2016|23:13:17|Initialization |06/15/2016|23:13:17|Filters |06/15/2016|23:16:27|Split Items |06/15/2016|23:20:18|Sort |06/15/2016|23:20:34|Check |06/16/2016|00:08:08|Save |06/16/2016|00:26:19|Update preparation |06/16/2016|00:26:20|Update comparison |06/16/2016|00:26:30|Update |06/16/2016|00:26:31|Update |06/16/2016|00:42:31|Close |06/16/2016|0
Re: [R] How to form groups for this specific problem?
Jean: Wow. Thank you so much for this. I will read up igraph and then see if this is going to work for me for the larger dataset. Thanks for the wonderful snippet code you wrote. Basically, the requirement is this: TLA1 (Top Level Assembly) and its components should belong to the same group. If a component belongs to a different TLA (say TLA2), then that TLA1 and all of its components should belong to the same as that of TLA1. Are these types of questions appropriate for this group? Thanks, Satish On Mar 28, 2016 9:10 AM, "Adams, Jean" wrote: > Satish, > > If you rearrange your data into a network of nodes and edges, you can use > the igraph package to identify disconnected (mutually exclusive) groups. > > # example data > df <- data.frame( > Component = c("C1", "C2", "C1", "C3", "C4", "C5"), > TLA = c("TLA1", "TLA1", "TLA2", "TLA2", "TLA3", "TLA3") > ) > > # characterize data as a network of nodes and edges > nodes <- levels(unlist(df)) > edges <- apply(df, 2, match, nodes) > > # use the igraph package to identify disconnected groups > library(igraph) > g <- graph(edges) > ngroup <- clusters(g)$membership > df$Group <- ngroup[match(df$Component, nodes)] > df > > Component TLA Group > 1C1 TLA1 1 > 2C2 TLA1 1 > 3C1 TLA2 1 > 4C3 TLA2 1 > 5C4 TLA3 2 > 6C5 TLA3 2 > > Jean > > On Sun, Mar 27, 2016 at 7:56 PM, Satish Vadlamani < > satish.vadlam...@gmail.com> wrote: > >> Hello All: >> I would like to get some help with the following problem and understand >> how >> this can be done in R efficiently. The header is given in the data frame. >> >> *Component, TLA* >> C1, TLA1 >> C2, TLA1 >> C1, TLA2 >> C3, TLA2 >> C4, TLA3 >> C5, TLA3 >> >> Notice that C1 is a component of TLA1 and TLA2. >> >> I would like to form groups of mutually exclusive subsets and create a new >> column called group for this subset. For the above data, the subsets and >> the new group column value will be like so: >> >> *Component, TLA, Group* >> C1, TLA1, 1 >> C2, TLA1, 1 >> C1, TLA2, 1 >> C3, TLA2, 1 >> C4, TLA3, 2 >> C5, TLA3, 2 >> >> Appreciate any help on this. I could have looped through the observations >> and tried some logic but I did not try that yet. >> >> -- >> >> Satish Vadlamani >> >> [[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to form groups for this specific problem?
Hello All: I would like to get some help with the following problem and understand how this can be done in R efficiently. The header is given in the data frame. *Component, TLA* C1, TLA1 C2, TLA1 C1, TLA2 C3, TLA2 C4, TLA3 C5, TLA3 Notice that C1 is a component of TLA1 and TLA2. I would like to form groups of mutually exclusive subsets and create a new column called group for this subset. For the above data, the subsets and the new group column value will be like so: *Component, TLA, Group* C1, TLA1, 1 C2, TLA1, 1 C1, TLA2, 1 C3, TLA2, 1 C4, TLA3, 2 C5, TLA3, 2 Appreciate any help on this. I could have looped through the observations and tried some logic but I did not try that yet. -- Satish Vadlamani [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reading large files
Matthew: If it is going to help, here is the explanation. I have an end state in mind. It is given below under "End State" header. In order to get there, I need to start somewhere right? I started with a 850 MB file and could not load in what I think is reasonable time (I waited for an hour). There are references to 64 bit. How will that help? It is a 4GB RAM machine and there is no paging activity when loading the 850 MB file. I have seen other threads on the same types of questions. I did not see any clear cut answers or errors that I could have been making in the process. If I am missing something, please let me know. Thanks. Satish End State > Satish wrote: "at one time I will need to load say 15GB into R" - Satish Vadlamani -- View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470667.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reading large files
Folks: Can anyone throw some light on this? Thanks. Satish - Satish Vadlamani -- View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470169.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.