Re: extract string from filename
On Jan 12, 2006, at 8:25 PM, Ben Scott wrote: It sounds like you could use a tutorial on Unix text processing and command line tools, specifically, one which addresses pipes and redirection, as well as the standard text tools (grep, cut, sed, awk, etc.). While Paul's recommendation about the O'Reilly regular expressions book is valid, I suspect it might be a little too focused on regex's and not cover some of the *other* elements you seem to be needing. Gee, I wonder if that would be a good topic for a meeting g. Bruce Dawson and David Berube did a presentation on Regular expressions that helped me grasp what they were and why I'd want to know more. Bought the Reg Exp book on my next visit to SoftPro s. A similar kind of presentation that explained the place of sed, grep, awk, pipes, redirection, tee and so forth. It's been forever for me, but I seem to recall that _Unix Power Tools_, also published by O'Reilly, covers all of the above and much, much more. If others on this list second my suggestion, you might want to obtain a copy. Alternatively, maybe list members can suggest alternatives? Re: UNIX Power Tools. Third time I've heard that recommended. Guess I'll add that to my wish list. Jerry Peek (http://www.oreillynet.com/pub/au/28 - a number of articles and book extracts linked here), one of the original authors of Unix Power Tools, has been running a series in Linux Magazine for a while now on working from the command line, including the inscrutable 21 and other arcana. Linux magazine is online at http://www.linux-mag.com/ and posts their issues sixty days after publication at http://www.linux-mag.com/ backissues/. Ben's other links are quite useful, too. The Answers Are Out There. The challenge is finding the answer you need now. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
On 1/12/06, Ben Scott [EMAIL PROTECTED] wrote: On 1/12/06, Zhao Peng [EMAIL PROTECTED] wrote: I'm back, with another extract string question. //grinIt sounds like you could use a tutorial on Unix text processing and command line tools, specifically, one which addresses pipes andredirection, as well as the standard text tools (grep, cut, sed, awk,etc.).While Paul's recommendation about the O'Reilly regularexpressions book is valid, I suspect it might be a little too focused on regex's and not cover some of the *other* elements you seem to beneeding.It's been forever for me, but I seem to recall that _Unix PowerTools_, also published by O'Reilly, covers all of the above and much, much more.If others on this list second my suggestion, you mightwant to obtain a copy.Alternatively, maybe list members can suggestalternatives?Unix Shell Programming by Kochan and Wood is a classic on shell programming Portable Shell Programming by BlinnThe Awk Programming Language by Aho, Weinberger and KernighanPower Tools is excellent but it more of a tip book in my mind. Not as much as the Hack series though. There are also a number of free guides at the Linux DocumentationProject.See: http://www.tldp.org/guides.htmlLook for anything mentioning bash (the Bourne-again shell) orscripting.I can't speak as to how good they are, but you can't beat the price.Some of them are very good. And the examples work.-- A strong conviction that something must be done is the parent of many bad measures. - Daniel Webster
Re: extract string from filename
On Jan 12, 2006, at 19:40, Zhao Peng wrote: I also downloaded an e-book called Learning Perl (OReilly, 4th.Edition), and had a quick look thru its Contents of Table, but did not find any chapter which looks likely addressing any issue related to my question. Good start. Read these sections: 'A Stroll Through Perl', 'The Split and Join Functions', 'Lists and Arrays', 'Hashes', 'Directory Access', and 'File Manipulation'. Your description is the outline of the algorithm. Take this script where I've filled in the requisite perl and figure out how it works: #!/usr/bin/perl -w use strict; # show stupid errors use warnings FATAL='all';# don't let you get away with them #I have almost 1k small files within one folder. The only pattern of the file names is: my $dirname = shift; # take the command line parameter as the directory name opendir DIRECTORY, $dirname; my @files = readdir(DIRECTORY); closedir DIRECTORY; #string1_string2_string3_string4.sas7bdat #Note: #1, string2 often repeat itself across each file name #2, All 4 strings contain no underscores. #3, 4 strings are separated by 3 underscores (as you can see) #4, The length of all 4 strings are not fixed. my (@part_2s); # we'll keep the second parts here foreach my $file (@files) { next if (($file eq '.') or ($file eq '..')); # the directory will contain . and .. which we don't want #My goal is to : #1, extract string2 from each file name my ($filename,$extension) = split('\.',$file); # don't forget to escape the . since this is a regex my @strings = split('_',$filename); my $part_2 = $strings[1]; # remember, arrays in perl are zero-indexed push(@part_2s,$part_2); # store the data we want on the end of the array } #2, keep only unique ones # perl trick using a hash to easily get unique items my (%temp_hash); foreach my $part (@part_2s) { $temp_hash{$part} = 1; } my @uniques = (keys %temp_hash); # and then sort them my @sorted = sort { $a cmp $b} (@uniques); # cmp for string storting #3, then output them to a .txt file. (one unique string2 per line) open OUTFILE, output.txt; foreach my $item (@sorted) { print OUTFILE $item . \n; } close OUTFILE; When you understand each line you'll be able to solve future similar problems easily. Note Kevin's perl solution is equally valid and probably faster, but you're not going to grok it until you excercise the perl part of your brain for a while. -Bill - Bill McGonigle, Owner Work: 603.448.4440 BFC Computing, LLC Home: 603.448.1668 [EMAIL PROTECTED] Cell: 603.252.2606 http://www.bfccomputing.com/Page: 603.442.1833 Blog: http://blog.bfccomputing.com/ VCard: http://bfccomputing.com/vcard/bill.vcf ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
Zhao Peng wrote: My goal is to : 1, extract string2 from each file name 2, then sort them and keep only unique ones 3, then output them to a .txt file. (one unique string2 per line) It is really interesting how many ways there are to do things in *nix. My first reaction, if this is a one time event, is to just use vi: % ls *.sas7bdat string2.txt % vi string2.txt :%s/^[^_]*_// :%s/_.*$// :%!sort -u :wq The first regex removes the first underscore and everything in front of it, while the second regex removes what is now the first underscore (was the second originally) and everything after it. And then I do the unique sort right in vi. Larry ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
Kevin, Thank you very much! I really appreciate it. I like your find approach, it's simple and easy to understand. I'll also try to understand your perl approach, when I got time to start learning it. (Hopefully it won't be un-fulfilled forever) I have one more question: Is it possible to number the extracted string2? Say, the output file contains the following list of extracted string2: st region local Any idea about what command to use to number the list to make it look like below: 1 st 2 region 3 local Again, thank you for your help and time! Zhao Kevin D. Clark wrote: Zhao Peng writes: I'm back, with another extract string question. //grin find FOLDERNAME -name \*sas7bdat -print | sed 's/.*\///' | cut -d _ -f 2 | sort -u somefile.txt or perl -MFile::Find -e 'find(sub{$string2 = (split /_/)[2]; $seen{$string2}++; }, @ARGV); map { print $_\n; } keys(%seen)' FOLDERNAME (which looks more readable as: perl -MFile::Find -e 'find(sub{ $string2 = (split /_/)[2]; $seen{$string2}++; }, @ARGV); map { print $_\n; } keys(%seen)' \ FOLDERNAME somefile.txt ) Either of which solves the problem that you describe. Actually, they solve more than the problem that you describe, since it wasn't apparent to me if you had any subdirectories here, but this is solved too) (substitute FOLDERNAME with your directory's name) Honestly, the first solution I present is the way I would have solved this problem myself. Very fast this way. Regards, --kevin ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
On Fri, Jan 13, 2006 at 11:40:26AM -0500, Zhao Peng wrote: Kevin, Thank you very much! I really appreciate it. I like your find approach, it's simple and easy to understand. I'll also try to understand your perl approach, when I got time to start learning it. (Hopefully it won't be un-fulfilled forever) I have one more question: Is it possible to number the extracted string2? Say, the output file contains the following list of extracted string2: st region local Any idea about what command to use to number the list to make it look like below: 1 st 2 region 3 local Pipe the output into pr -n -T This is not pr's intended use, but it will work. -n option means put numbers on the lines, -T option means No page breaks. The -n option appears to be missing from the FC2 man pages. -- Jeff Kinz, Emergent Research, Hudson, MA. speech recognition software may have been used to create this e-mail The greatest dangers to liberty lurk in insidious encroachment by men of zeal, well-meaning but without understanding. - Brandeis To think contrary to one's era is heroism. But to speak against it is madness. -- Eugene Ionesco ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
On 1/13/06, Zhao Peng [EMAIL PROTECTED] wrote: Is it possible to number the extracted string2? find -name \*sas7bdat -printf '%f\n' | cut -d _ -f 2 | sort | uniq | cat -n Run that pipeline in the directory you are interested in. The find(1) command finds files, based on their name or other filesystem attributes. The -name \*sas7bdat part finds files with file names which match the pattern. There backslash escapes the star, to keep the shell from trying to interpret it, so find gets the star instead. The -printf '%f\n' part has find output just the file name, not the path. cut(1) is used to split input strings, as you know. -d _ splits into fields, based on underscores. -f 2 outputs the second field only, one per line. sort(1) sorts, and uniq(1) eliminates duplicate lines. cat -n numbers the output. -- Ben Pay attention, there's gonna be a quiz next week Scott ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
On 1/13/06, Ben Scott [EMAIL PROTECTED] wrote: On 1/13/06, Zhao Peng [EMAIL PROTECTED] wrote: Is it possible to number the extracted string2? find -name \*sas7bdat -printf '%f\n' | cut -d _ -f 2 | sort | uniq | cat -n I forgot to mention: If the *only* files in that directory are the ones with the interesting file names, you can just use this: ls | cut -d _ -f 2 | sort | uniq | cat -n -- Ben I would flunk the quiz Scott ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
cat -n will number output lines ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
Zhao Peng wrote: string1_string2_string3_string4.sas7bdat abc_st_nh_num.sas7bdat abc_st_vt_num.sas7bdat abc_st_ma_num.sas7bdat abcd_region_NewEngland_num.sas7bdat abcd_region_South_num.sas7bdat My goal is to : 1, extract string2 from each file name 2, then sort them and keep only unique ones 3, then output them to a .txt file. (one unique string2 per line) Solution #1: ls -1 *sas7bdat|awk -F_ '{print $2}'|sort -fu|cat -n output.txt Take output of ls, 1 file per line (ls -1) - only files ending with sas7bdat Feed into awk, splitting on _, print the 2nd field Sort ignoring case, eliminating duplicates (sort options: f folds case, u keeps only uniques) Number the lines (cat -n) Put output in file named output.txt Solution #2: ls -1 *sas7bdat|sed 's/^\([a-zA-Z0-9]*_\)\([a-zA-Z0-9]*\)_.*$/\2/'|sort -fu|cat -n output.txt Use sed (stream editor) to break up filenames into atoms separated by _, and output the 2nd one (the \2). Regular expressions (regex) can be very handy. ^ matches beginning of string, [a-zA-Z0-9]*_ matches letter/number string ending with _, the backslashed parentheses groups the patterns, so the 2nd one can be extracted. There are many solutions to the problem, as you can see. -- Dan Jenkins ([EMAIL PROTECTED]) Rastech Inc., Bedford, NH, USA --- 1-603-206-9951 *** Technical Support Excellence for over a quarter century ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
[EMAIL PROTECTED] (Kevin D. Clark) writes: Zhao Peng writes: I'm back, with another extract string question. //grin find FOLDERNAME -name \*sas7bdat -print | sed 's/.*\///' | cut -d _ -f 2 | sort -u somefile.txt Or, to simplify this: find ./ -name \*sas7bdat | awk -F_ '{print $2}' |sort -u ls *sas7bdat | perl -F_ -ane 'print $F[1]\n;'|sort -u perl -e 'opendir(DIR,.); map { if (/sas7bdat$/) { $k = (split(/_/,$_))[1]; $f{$k} =1; } } readdir(DIR); map { print $_\n;}sort keys %f;' That last one might be a little better formatted like: perl -e 'opendir(DIR,.); map { if (/sas7bdat$/) { $k = (split(/_/,$_))[1]; $f{$k}=1; } } readdir(DIR); map { print $_\n;} sort keys %f;' It should be rather obvious that your best bet for quick one-liners for this type of thing is to probably stick with standard UNIX tools like sort, cut, sed, awk, etc. Perl is great for text manipulation, but as you can see, none of the perl one-liners has been nearly as concise as the shell variants. If speed matters, or process overhead, then maybe perl is better. Of course for such a small data set as you've given, the perl versions are both harder and longer to type. hth. -- Seeya, Paul ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
Tom Buskey [EMAIL PROTECTED] writes: Unix Shell Programming by Kochan and Wood is a classic on shell programming Portable Shell Programming by Blinn The Awk Programming Language by Aho, Weinberger and Kernighan I'm also a big fan of Kernighan and Pikes, The UNIX Programming Environment. When I first saw this book I thought it was going to be more of a C programming book explaining thinks like linking and compiling under UNIX. However, it turned out to be simply a great book on how to get around the shell and do a variety of things in the UNIX environment. So named the UNIX Progamming Environment because, as we've all seen here, the shell is *programmable* :) And, yet another plug for my all-time favorite UNIX book, The UNIX Philosophy by Mike Gancarz, which has recently been updated with a second edition (which I have not yet read) The Linux and UNIX Philosophy. This book does a fantastic job of explaining exactly *why* UNIX is such a great environment, and why other competing environments just can't compete when what you need is raw power and flexibility. -- Seeya, Paul ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
extract string from filename
Hi all, I'm back, with another extract string question. //grin I have almost 1k small files within one folder. The only pattern of the file names is: string1_string2_string3_string4.sas7bdat Note: 1, string2 often repeat itself across each file name For example: abc_st_nh_num.sas7bdat abc_st_vt_num.sas7bdat abc_st_ma_num.sas7bdat abcd_region_NewEngland_num.sas7bdat abcd_region_South_num.sas7bdat 2, All 4 strings contain no underscores. 3, 4 strings are separated by 3 underscores (as you can see) 4, The length of all 4 strings are not fixed. My goal is to : 1, extract string2 from each file name 2, then sort them and keep only unique ones 3, then output them to a .txt file. (one unique string2 per line) I tried to use cut commands, but can't even figure out how to use the filenames as input. Anyone care to offer me a hint? I also downloaded an e-book called Learning Perl (OReilly, 4th.Edition), and had a quick look thru its Contents of Table, but did not find any chapter which looks likely addressing any issue related to my question. Thank you very much! Zhao ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
On 1/12/06, Zhao Peng [EMAIL PROTECTED] wrote: I'm back, with another extract string question. //grin It sounds like you could use a tutorial on Unix text processing and command line tools, specifically, one which addresses pipes and redirection, as well as the standard text tools (grep, cut, sed, awk, etc.). While Paul's recommendation about the O'Reilly regular expressions book is valid, I suspect it might be a little too focused on regex's and not cover some of the *other* elements you seem to be needing. It's been forever for me, but I seem to recall that _Unix Power Tools_, also published by O'Reilly, covers all of the above and much, much more. If others on this list second my suggestion, you might want to obtain a copy. Alternatively, maybe list members can suggest alternatives? There are also a number of free guides at the Linux Documentation Project. See: http://www.tldp.org/guides.html Look for anything mentioning bash (the Bourne-again shell) or scripting. I can't speak as to how good they are, but you can't beat the price. Anyway, on to your question... I tried to use cut commands, but can't even figure out how to use the filenames as input. Anyone care to offer me a hint? You'll want to pipe the output of ls to cut. This should get you started: ls -1 | cut -d _ -f 2 The -1 switch to ls(1) tells it to output a single column of file names. Some versions of ls do this automagically when using redirection, but it is best to be sure. The -d _ switch to cut(1) tells cut to split fields on the underscore. The -f 2 selects the second field. See also: sort(1), uniq(1) Hope this helps! -- Ben Unix plumber Scott ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
On Thu, 2006-01-12 at 19:40 -0500, Zhao Peng wrote: For example: abc_st_nh_num.sas7bdat abc_st_vt_num.sas7bdat abc_st_ma_num.sas7bdat abcd_region_NewEngland_num.sas7bdat abcd_region_South_num.sas7bdat You're not the only one learning here. I put these names into a file called str2-test-data $ cut -d _ -f 2 str2-test-data | sort | uniq region st I think that you could use: ls | cut -d _ -f 2 | sort | uniq str2-results.txt -- Lloyd Kvam Venix Corp ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: extract string from filename
Zhao Peng writes: I'm back, with another extract string question. //grin find FOLDERNAME -name \*sas7bdat -print | sed 's/.*\///' | cut -d _ -f 2 | sort -u somefile.txt or perl -MFile::Find -e 'find(sub{$string2 = (split /_/)[2]; $seen{$string2}++; }, @ARGV); map { print $_\n; } keys(%seen)' FOLDERNAME (which looks more readable as: perl -MFile::Find -e 'find(sub{ $string2 = (split /_/)[2]; $seen{$string2}++; }, @ARGV); map { print $_\n; } keys(%seen)' \ FOLDERNAME somefile.txt ) Either of which solves the problem that you describe. Actually, they solve more than the problem that you describe, since it wasn't apparent to me if you had any subdirectories here, but this is solved too) (substitute FOLDERNAME with your directory's name) Honestly, the first solution I present is the way I would have solved this problem myself. Very fast this way. Regards, --kevin -- (There are also also 228 babies named Unique during the 1990s alone, and 1 each of Uneek, Uneque, and Uneqqee.) -- _Freakonomics_, Steven D. Levitt and Stephen J. Dubner [but no Unix folks named their kids uniq, apparently. --kevin] ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss