Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list
Interesting. Thanks. On Sat, 2009-02-07 at 02:36 +0100, Wacek Kusnierczyk wrote: Andrew Choens wrote: I regularly deal with a similar pattern at work. People send me these big long .csv files and I have to run them through some pattern analysis to decide which rows I keep and which rows I kill off. As others have mentioned, Perl is a good candidate for this task. Another option would be a quick SQL query. It should be a snap to pull this into something like Access or OOo Base . . . . or better yet, a real database like Postgres, MySQL, etc. In case you aren't too familiar with SQL, this query could be done by deleting the rows using a self join (syntax varies by product). But, if the pattern is as simple as it sounds and / or this is a one-time job, using SQL is over-kill for the situation. I often use sed in places where Perl is over-kill, but I can't think of any way to match from row to row with sed. If anyone knows how to do this with sed, it would (probably) be easier than trying to learn how to use perl. And, I would like to know how to do this with sed too. (this is actually off-topic, but since it may be interesting for the general public, i keep the response cc: to r-help) yes, you can do this with sed. suppose you have two files, one (say, sample.txt) with the data to be filtered, record fields separated by, e.g., a tab character, and another (say, filter.txt) with patterns to be matched. a row from the first is passed to output only of its second field does not match any of the patterns -- this corresponds to (a simplified version of) the original problem. then, the following should do: sed $(sed 's/^/\/^[^\\t]\\+\\t/; s/$/\/d/' filter.txt) sample.txt filtered-sample.txt (unless the patterns contain characters that interfere with the shell or sed's syntax, in which case they'd have to be appropriately escaped.) vQ -- This is the price and the promise of citizenship. -- Barack Obama, 44th President of the United States __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to delete specific rows in a data frame where the first column matches any string from a list
Hi, I'm new in the mailing list but I would appreciate if you could help me with this: I have a big matrix from where I need to delete specific rows. The second entry on these rows to delete should match any string within a list (other file with just one column). Thank you so much! Laura __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list
Laura Rodriguez Murillo wrote: Hi, I'm new in the mailing list but I would appreciate if you could help me with this: I have a big matrix from where I need to delete specific rows. The second entry on these rows to delete should match any string within a list (other file with just one column). Thank you so much! here's one way to do it, illustrated with dummy data: # dummy character matrix data = matrix(replicate(20, paste(sample(letters, 20), collapse=)), ncol=2) # filter out rows where second column does not match 'a' data[-grep('a', d[,2]),] this will work also if your data is actually a data frame: data = as.data.frame(data) data[-grep('a', d[,2]),] note, due to a known issue with grep, this won't work correctly if there are *no* rows that do *not* match the pattern: data[-grep('1', d[,2]),] # should return all of data, but returns an empty matrix with the upcoming version of r, grep will have an additional argument which will make this problem easy to fix: data[grep('a', d[,2], invert=TRUE),] vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list
Thank you. I think grep would do it, but the list of expressions I need to match is too long so they are stored in a file. So the question would be how I can tell R to look into that file to look for the expressions that I want to match. Thank you again for your help Laura 2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no: Laura Rodriguez Murillo wrote: Hi, I'm new in the mailing list but I would appreciate if you could help me with this: I have a big matrix from where I need to delete specific rows. The second entry on these rows to delete should match any string within a list (other file with just one column). Thank you so much! here's one way to do it, illustrated with dummy data: # dummy character matrix data = matrix(replicate(20, paste(sample(letters, 20), collapse=)), ncol=2) # filter out rows where second column does not match 'a' data[-grep('a', d[,2]),] this will work also if your data is actually a data frame: data = as.data.frame(data) data[-grep('a', d[,2]),] note, due to a known issue with grep, this won't work correctly if there are *no* rows that do *not* match the pattern: data[-grep('1', d[,2]),] # should return all of data, but returns an empty matrix with the upcoming version of r, grep will have an additional argument which will make this problem easy to fix: data[grep('a', d[,2], invert=TRUE),] vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list
Laura Rodriguez Murillo wrote: Thank you. I think grep would do it, but the list of expressions I need to match is too long so they are stored in a file. what does 'too long' mean? So the question would be how I can tell R to look into that file to look for the expressions that I want to match. i guess you may still successfully use r for this, but to me it sounds like a perfect job for perl. let me know if you need more help. note, in the below, you'd use 'data[,2]' instead of 'd[,2]' (or 'd' instead of 'data'). sorry for the typo. mark, thanks for pointing this out -- the more obvious the mistake, the less visible ;) vQ Thank you again for your help Laura 2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no: Laura Rodriguez Murillo wrote: Hi, I'm new in the mailing list but I would appreciate if you could help me with this: I have a big matrix from where I need to delete specific rows. The second entry on these rows to delete should match any string within a list (other file with just one column). Thank you so much! here's one way to do it, illustrated with dummy data: # dummy character matrix data = matrix(replicate(20, paste(sample(letters, 20), collapse=)), ncol=2) # filter out rows where second column does not match 'a' data[-grep('a', d[,2]),] this will work also if your data is actually a data frame: data = as.data.frame(data) data[-grep('a', d[,2]),] note, due to a known issue with grep, this won't work correctly if there are *no* rows that do *not* match the pattern: data[-grep('1', d[,2]),] # should return all of data, but returns an empty matrix with the upcoming version of r, grep will have an additional argument which will make this problem easy to fix: data[grep('a', d[,2], invert=TRUE),] vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list
yep, it definitely sounds like a work for perl, but I don't know perl (unfortunately). I'm still stuck with this so I'm giving more details in case it helps: I have file A with 382 columns and 30 rows. There are rows where only the entry in first column is duplicated in other rows. In these cases, I need to delete the entire row. I also have a file B (one column and around 28 rows) with a list of the entries that are repeated. So I was trying to look for the ones that match and get rid of the entire row. Thank you! Laura 2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no: Laura Rodriguez Murillo wrote: Thank you. I think grep would do it, but the list of expressions I need to match is too long so they are stored in a file. what does 'too long' mean? So the question would be how I can tell R to look into that file to look for the expressions that I want to match. i guess you may still successfully use r for this, but to me it sounds like a perfect job for perl. let me know if you need more help. note, in the below, you'd use 'data[,2]' instead of 'd[,2]' (or 'd' instead of 'data'). sorry for the typo. mark, thanks for pointing this out -- the more obvious the mistake, the less visible ;) vQ Thank you again for your help Laura 2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no: Laura Rodriguez Murillo wrote: Hi, I'm new in the mailing list but I would appreciate if you could help me with this: I have a big matrix from where I need to delete specific rows. The second entry on these rows to delete should match any string within a list (other file with just one column). Thank you so much! here's one way to do it, illustrated with dummy data: # dummy character matrix data = matrix(replicate(20, paste(sample(letters, 20), collapse=)), ncol=2) # filter out rows where second column does not match 'a' data[-grep('a', d[,2]),] this will work also if your data is actually a data frame: data = as.data.frame(data) data[-grep('a', d[,2]),] note, due to a known issue with grep, this won't work correctly if there are *no* rows that do *not* match the pattern: data[-grep('1', d[,2]),] # should return all of data, but returns an empty matrix with the upcoming version of r, grep will have an additional argument which will make this problem easy to fix: data[grep('a', d[,2], invert=TRUE),] vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list
I regularly deal with a similar pattern at work. People send me these big long .csv files and I have to run them through some pattern analysis to decide which rows I keep and which rows I kill off. As others have mentioned, Perl is a good candidate for this task. Another option would be a quick SQL query. It should be a snap to pull this into something like Access or OOo Base . . . . or better yet, a real database like Postgres, MySQL, etc. In case you aren't too familiar with SQL, this query could be done by deleting the rows using a self join (syntax varies by product). But, if the pattern is as simple as it sounds and / or this is a one-time job, using SQL is over-kill for the situation. I often use sed in places where Perl is over-kill, but I can't think of any way to match from row to row with sed. If anyone knows how to do this with sed, it would (probably) be easier than trying to learn how to use perl. And, I would like to know how to do this with sed too. On Fri, 2009-02-06 at 16:04 -0500, Laura Rodriguez Murillo wrote: yep, it definitely sounds like a work for perl, but I don't know perl (unfortunately). I'm still stuck with this so I'm giving more details in case it helps: I have file A with 382 columns and 30 rows. There are rows where only the entry in first column is duplicated in other rows. In these cases, I need to delete the entire row. I also have a file B (one column and around 28 rows) with a list of the entries that are repeated. So I was trying to look for the ones that match and get rid of the entire row. Thank you! Laura 2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no: Laura Rodriguez Murillo wrote: Thank you. I think grep would do it, but the list of expressions I need to match is too long so they are stored in a file. what does 'too long' mean? So the question would be how I can tell R to look into that file to look for the expressions that I want to match. i guess you may still successfully use r for this, but to me it sounds like a perfect job for perl. let me know if you need more help. note, in the below, you'd use 'data[,2]' instead of 'd[,2]' (or 'd' instead of 'data'). sorry for the typo. mark, thanks for pointing this out -- the more obvious the mistake, the less visible ;) vQ Thank you again for your help Laura 2009/2/6 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no: Laura Rodriguez Murillo wrote: Hi, I'm new in the mailing list but I would appreciate if you could help me with this: I have a big matrix from where I need to delete specific rows. The second entry on these rows to delete should match any string within a list (other file with just one column). Thank you so much! here's one way to do it, illustrated with dummy data: # dummy character matrix data = matrix(replicate(20, paste(sample(letters, 20), collapse=)), ncol=2) # filter out rows where second column does not match 'a' data[-grep('a', d[,2]),] this will work also if your data is actually a data frame: data = as.data.frame(data) data[-grep('a', d[,2]),] note, due to a known issue with grep, this won't work correctly if there are *no* rows that do *not* match the pattern: data[-grep('1', d[,2]),] # should return all of data, but returns an empty matrix with the upcoming version of r, grep will have an additional argument which will make this problem easy to fix: data[grep('a', d[,2], invert=TRUE),] vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- This is the price and the promise of citizenship. -- Barack Obama, 44th President of the United States __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list
Hi Laura, You might want to read the manual on Data importation and exportation on the cran webpage http://cran.r-project.org/ Otherwise, have a look at ?read.table. Sebastien __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list
Thank you so much! I finally got it. Laura 2009/2/6 Sebastien Bihorel sebastien.biho...@cognigencorp.com: Hi Laura, You might want to read the manual on Data importation and exportation on the cran webpage http://cran.r-project.org/ Otherwise, have a look at ?read.table. Sebastien __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to delete specific rows in a data frame where the first column matches any string from a list
Andrew Choens wrote: I regularly deal with a similar pattern at work. People send me these big long .csv files and I have to run them through some pattern analysis to decide which rows I keep and which rows I kill off. As others have mentioned, Perl is a good candidate for this task. Another option would be a quick SQL query. It should be a snap to pull this into something like Access or OOo Base . . . . or better yet, a real database like Postgres, MySQL, etc. In case you aren't too familiar with SQL, this query could be done by deleting the rows using a self join (syntax varies by product). But, if the pattern is as simple as it sounds and / or this is a one-time job, using SQL is over-kill for the situation. I often use sed in places where Perl is over-kill, but I can't think of any way to match from row to row with sed. If anyone knows how to do this with sed, it would (probably) be easier than trying to learn how to use perl. And, I would like to know how to do this with sed too. (this is actually off-topic, but since it may be interesting for the general public, i keep the response cc: to r-help) yes, you can do this with sed. suppose you have two files, one (say, sample.txt) with the data to be filtered, record fields separated by, e.g., a tab character, and another (say, filter.txt) with patterns to be matched. a row from the first is passed to output only of its second field does not match any of the patterns -- this corresponds to (a simplified version of) the original problem. then, the following should do: sed $(sed 's/^/\/^[^\\t]\\+\\t/; s/$/\/d/' filter.txt) sample.txt filtered-sample.txt (unless the patterns contain characters that interfere with the shell or sed's syntax, in which case they'd have to be appropriately escaped.) vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.