Re: extract string from filename

2006-01-13 Thread Ted Roche

On Jan 12, 2006, at 8:25 PM, Ben Scott wrote:


  It sounds like you could use a tutorial on Unix text processing and
command line tools, specifically, one which addresses pipes and
redirection, as well as the standard text tools (grep, cut, sed, awk,
etc.).  While Paul's recommendation about the O'Reilly regular
expressions book is valid, I suspect it might be a little too focused
on regex's and not cover some of the *other* elements you seem to be
needing.


Gee, I wonder if that would be a good topic for a meeting g.

Bruce Dawson and David Berube did a presentation on Regular  
expressions that helped me grasp what they were and why I'd want to  
know more. Bought the Reg Exp book on my next visit to SoftPro s.


A similar kind of presentation that explained the place of sed, grep,  
awk, pipes, redirection, tee and so forth.



  It's been forever for me, but I seem to recall that _Unix Power
Tools_, also published by O'Reilly, covers all of the above and much,
much more.  If others on this list second my suggestion, you might
want to obtain a copy.  Alternatively, maybe list members can suggest
alternatives?


Re: UNIX Power Tools. Third time I've heard that recommended. Guess  
I'll add that to my wish list.


Jerry Peek (http://www.oreillynet.com/pub/au/28 - a number of  
articles and book extracts linked here), one of the original authors  
of Unix Power Tools, has been running a series in Linux Magazine for  
a while now on working from the command line, including the  
inscrutable 21 and other arcana.


Linux magazine is online at http://www.linux-mag.com/ and posts their  
issues sixty days after publication at http://www.linux-mag.com/ 
backissues/.


Ben's other links are quite useful, too. The Answers Are Out There.  
The challenge is finding the answer you need now.

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Tom Buskey
On 1/12/06, Ben Scott [EMAIL PROTECTED] wrote:
On 1/12/06, Zhao Peng [EMAIL PROTECTED] wrote: I'm back, with another extract string question. //grinIt sounds like you could use a tutorial on Unix text processing and
command line tools, specifically, one which addresses pipes andredirection, as well as the standard text tools (grep, cut, sed, awk,etc.).While Paul's recommendation about the O'Reilly regularexpressions book is valid, I suspect it might be a little too focused
on regex's and not cover some of the *other* elements you seem to beneeding.It's been forever for me, but I seem to recall that _Unix PowerTools_, also published by O'Reilly, covers all of the above and much,
much more.If others on this list second my suggestion, you mightwant to obtain a copy.Alternatively, maybe list members can suggestalternatives?Unix Shell Programming by Kochan and Wood is a classic on shell programming
Portable Shell Programming by BlinnThe Awk Programming Language by Aho, Weinberger and KernighanPower Tools is excellent but it more of a tip book in my mind. Not as much as the Hack series though. 
There are also a number of free guides at the Linux DocumentationProject.See:
http://www.tldp.org/guides.htmlLook for anything mentioning bash (the Bourne-again shell) orscripting.I can't speak as to how good they are, but you can't beat
the price.Some of them are very good. And the examples work.-- A strong conviction that something must be done is the parent of many bad measures.
- Daniel Webster


Re: extract string from filename

2006-01-13 Thread Bill McGonigle

On Jan 12, 2006, at 19:40, Zhao Peng wrote:

I also downloaded an e-book called Learning Perl (OReilly, 
4th.Edition), and had a quick look thru its Contents of Table, but did 
not find any chapter which looks likely addressing any issue related 
to my question.


Good start.  Read these sections: 'A Stroll Through Perl', 'The Split 
and Join Functions', 'Lists and Arrays', 'Hashes', 'Directory Access', 
and 'File Manipulation'.


Your description is the outline of the algorithm.  Take this script 
where I've filled in the requisite perl and figure out how it works:


#!/usr/bin/perl -w
use strict;   # show stupid errors
use warnings FATAL='all';# don't let you get away with them

#I have almost 1k small files within one folder. The only pattern of 
the file names is:
my $dirname = shift; # take the command line parameter as the directory 
name

opendir DIRECTORY, $dirname;
my @files = readdir(DIRECTORY);
closedir DIRECTORY;

#string1_string2_string3_string4.sas7bdat

#Note:
#1, string2 often repeat itself across each file name
#2, All 4 strings contain no underscores.
#3, 4 strings are separated by  3 underscores (as you can see)
#4, The length of all 4 strings are not fixed.

my (@part_2s);  # we'll keep the second parts here
foreach my $file (@files) {
next if (($file eq '.') or ($file eq '..')); # the directory will 
contain . and .. which we don't want

#My goal is to :
#1, extract string2 from each file name
my ($filename,$extension) = split('\.',$file); # don't forget to 
escape the . since this is a regex

my @strings = split('_',$filename);
my $part_2 = $strings[1]; # remember, arrays in perl are 
zero-indexed
push(@part_2s,$part_2);   # store the data we want on the end of 
the array

}

#2, keep only unique ones
# perl trick using a hash to easily get unique items
my (%temp_hash);
foreach my $part (@part_2s) {
$temp_hash{$part} = 1;
}
my @uniques = (keys %temp_hash);

# and then sort them
my @sorted = sort { $a cmp $b}  (@uniques);  # cmp for string storting

#3, then output them to a .txt file. (one unique string2 per line)
open OUTFILE, output.txt;
foreach my $item (@sorted) {
print OUTFILE $item . \n;
}
close OUTFILE;

When you understand each line you'll be able to solve future similar 
problems easily.  Note Kevin's perl solution is equally valid and 
probably faster, but you're not going to grok it until you excercise 
the perl part of your brain for a while.


-Bill
-
Bill McGonigle, Owner   Work: 603.448.4440
BFC Computing, LLC  Home: 603.448.1668
[EMAIL PROTECTED]   Cell: 603.252.2606
http://www.bfccomputing.com/Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Larry Cook

Zhao Peng wrote:

My goal is to :
1, extract string2 from each file name
2, then sort them and keep only unique ones
3, then output them to a .txt file. (one unique string2 per line)


It is really interesting how many ways there are to do things in *nix.  My 
first reaction, if this is a one time event, is to just use vi:


% ls *.sas7bdat  string2.txt
% vi string2.txt
:%s/^[^_]*_//
:%s/_.*$//
:%!sort -u
:wq

The first regex removes the first underscore and everything in front of it, 
while the second regex removes what is now the first underscore (was the 
second originally) and everything after it.  And then I do the unique sort 
right in vi.


Larry
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Zhao Peng

Kevin,

Thank you very much! I really appreciate it.

I like your find approach, it's simple and easy to understand.

I'll also try to understand your perl approach, when I got time to start 
learning it. (Hopefully it won't be un-fulfilled forever)


I have one more question:

Is it possible to number the extracted string2?

Say, the output file contains the following list of extracted string2:

st
region
local

Any idea about what command to use to number the list  to make it look 
like below:


1 st
2 region
3 local

Again, thank you for your help and time!

Zhao

Kevin D. Clark wrote:

Zhao Peng writes:

  

I'm back, with another extract string question. //grin




find FOLDERNAME -name \*sas7bdat -print | sed 's/.*\///' | cut -d _ -f 2 | sort -u 
 somefile.txt

or

perl -MFile::Find -e 'find(sub{$string2 = (split /_/)[2]; $seen{$string2}++; }, @ARGV); 
map { print $_\n; } keys(%seen)' FOLDERNAME

(which looks more readable as:

  perl -MFile::Find -e 'find(sub{ $string2 = (split /_/)[2];
  $seen{$string2}++;
 }, @ARGV);
  
 map { print $_\n; } keys(%seen)' \

  FOLDERNAME  somefile.txt

)

Either of which solves the problem that you describe.  Actually, they
solve more than the problem that you describe, since it wasn't
apparent to me if you had any subdirectories here, but this is solved too)

(substitute FOLDERNAME with your directory's name)


Honestly, the first solution I present is the way I would have solved
this problem myself.  Very fast this way.

Regards,

--kevin
  


___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Jeff Kinz
On Fri, Jan 13, 2006 at 11:40:26AM -0500, Zhao Peng wrote:
 Kevin,
 
 Thank you very much! I really appreciate it.
 
 I like your find approach, it's simple and easy to understand.
 
 I'll also try to understand your perl approach, when I got time to start 
 learning it. (Hopefully it won't be un-fulfilled forever)
 
 I have one more question:
 
 Is it possible to number the extracted string2?
 
 Say, the output file contains the following list of extracted string2:
 
 st
 region
 local
 
 Any idea about what command to use to number the list  to make it look 
 like below:
 
 1 st
 2 region
 3 local


Pipe the output into pr -n -T

This is not pr's intended use, but it will work.  -n option means put
numbers on the lines, -T option means No page breaks.

The -n option appears to be missing from the FC2 man pages.


-- 
Jeff Kinz, Emergent Research, Hudson, MA.
speech recognition software may have been used to create this e-mail

The greatest dangers to liberty lurk in insidious encroachment by men
of zeal, well-meaning but without understanding. - Brandeis

To think contrary to one's era is heroism. But to speak against it is
madness. -- Eugene Ionesco
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Ben Scott
On 1/13/06, Zhao Peng [EMAIL PROTECTED] wrote:
 Is it possible to number the extracted string2?

find -name \*sas7bdat -printf '%f\n' | cut -d _ -f 2 | sort | uniq | cat -n

  Run that pipeline in the directory you are interested in.

  The find(1) command finds files, based on their name or other
filesystem attributes.

  The -name \*sas7bdat part finds files with file names which match
the pattern.  There backslash escapes the star, to keep the shell from
trying to interpret it, so find gets the star instead.

  The -printf '%f\n' part has find output just the file name, not the path.

  cut(1) is used to split input strings, as you know.  -d _ splits
into fields, based on underscores.  -f 2 outputs the second field
only, one per line.

  sort(1) sorts, and uniq(1) eliminates duplicate lines.

  cat -n numbers the output.

-- Ben Pay attention, there's gonna be a quiz next week Scott
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Ben Scott
On 1/13/06, Ben Scott [EMAIL PROTECTED] wrote:
 On 1/13/06, Zhao Peng [EMAIL PROTECTED] wrote:
  Is it possible to number the extracted string2?

 find -name \*sas7bdat -printf '%f\n' | cut -d _ -f 2 | sort | uniq | cat -n

  I forgot to mention: If the *only* files in that directory are the
ones with the interesting file names, you can just use this:

ls | cut -d _ -f 2 | sort | uniq | cat -n

-- Ben I would flunk the quiz Scott
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Michael ODonnell


cat -n will number output lines

 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Dan Jenkins

Zhao Peng wrote:


string1_string2_string3_string4.sas7bdat

abc_st_nh_num.sas7bdat
abc_st_vt_num.sas7bdat
abc_st_ma_num.sas7bdat
abcd_region_NewEngland_num.sas7bdat
abcd_region_South_num.sas7bdat

My goal is to :
1, extract string2 from each file name
2, then sort them and keep only unique ones
3, then output them to a .txt file. (one unique string2 per line)


Solution #1:
ls -1 *sas7bdat|awk -F_ '{print $2}'|sort -fu|cat -n output.txt

Take output of ls, 1 file per line (ls -1) - only files ending with sas7bdat
Feed into awk, splitting on _, print the 2nd field
Sort ignoring case, eliminating duplicates (sort options: f folds 
case, u keeps only uniques)

Number the lines (cat -n)
Put output in file named output.txt

Solution #2:
ls -1 *sas7bdat|sed 's/^\([a-zA-Z0-9]*_\)\([a-zA-Z0-9]*\)_.*$/\2/'|sort 
-fu|cat -n output.txt
Use sed (stream editor) to break up filenames into atoms separated by _, 
and output the 2nd one (the \2). Regular expressions (regex) can be very 
handy. ^ matches beginning of string, [a-zA-Z0-9]*_ matches 
letter/number string ending with _, the backslashed parentheses groups 
the patterns, so the 2nd one can be extracted.


There are many solutions to the problem, as you can see.

--
Dan Jenkins ([EMAIL PROTECTED])
Rastech Inc., Bedford, NH, USA --- 1-603-206-9951
*** Technical Support Excellence for over a quarter century

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Paul Lussier
[EMAIL PROTECTED] (Kevin D. Clark) writes:

 Zhao Peng writes:

 I'm back, with another extract string question. //grin


 find FOLDERNAME -name \*sas7bdat -print | sed 's/.*\///' | cut -d _ -f 2 | 
 sort -u  somefile.txt

Or, to simplify this:

  find ./ -name \*sas7bdat | awk -F_ '{print $2}' |sort -u
  ls *sas7bdat | perl -F_ -ane 'print $F[1]\n;'|sort -u
  perl -e 'opendir(DIR,.); map { if (/sas7bdat$/) { $k = (split(/_/,$_))[1]; 
$f{$k} =1; } } readdir(DIR); map { print $_\n;}sort keys %f;'

That last one might be a little better formatted like:

  perl -e 'opendir(DIR,.);
   map { if (/sas7bdat$/) { 
   $k = (split(/_/,$_))[1];
   $f{$k}=1; 
 }
   } readdir(DIR);
   map { print $_\n;} sort keys %f;'

It should be rather obvious that your best bet for quick one-liners
for this type of thing is to probably stick with standard UNIX tools
like sort, cut, sed, awk, etc.  Perl is great for text manipulation,
but as you can see, none of the perl one-liners has been nearly as
concise as the shell variants.  If speed matters, or process overhead,
then maybe perl is better.  Of course for such a small data set as
you've given, the perl versions are both harder and longer to type.

hth.
-- 

Seeya,
Paul
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-13 Thread Paul Lussier
Tom Buskey [EMAIL PROTECTED] writes:

 Unix Shell Programming by Kochan and Wood is a classic on shell programming


 Portable Shell Programming by Blinn
 The Awk Programming Language by Aho, Weinberger and Kernighan

I'm also a big fan of Kernighan and Pikes, The UNIX Programming
Environment.  When I first saw this book I thought it was going to be
more of a C programming book explaining thinks like linking and
compiling under UNIX. However, it turned out to be simply a great book
on how to get around the shell and do a variety of things in the UNIX
environment.  So named the UNIX Progamming Environment because, as
we've all seen here, the shell is *programmable* :)

And, yet another plug for my all-time favorite UNIX book, The UNIX
Philosophy by Mike Gancarz, which has recently been updated with a
second edition (which I have not yet read) The Linux and UNIX
Philosophy.  This book does a fantastic job of explaining exactly
*why* UNIX is such a great environment, and why other competing
environments just can't compete when what you need is raw power and
flexibility.
-- 

Seeya,
Paul
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


extract string from filename

2006-01-12 Thread Zhao Peng

Hi all,

I'm back, with another extract string question. //grin

I have almost 1k small files within one folder. The only pattern of the 
file names is:


string1_string2_string3_string4.sas7bdat

Note:
1, string2 often repeat itself across each file name
For example:
abc_st_nh_num.sas7bdat
abc_st_vt_num.sas7bdat
abc_st_ma_num.sas7bdat
abcd_region_NewEngland_num.sas7bdat
abcd_region_South_num.sas7bdat

2, All 4 strings contain no underscores.
3, 4 strings are separated by  3 underscores (as you can see)
4, The length of all 4 strings are not fixed.

My goal is to :
1, extract string2 from each file name
2, then sort them and keep only unique ones
3, then output them to a .txt file. (one unique string2 per line)

I tried to use cut commands, but can't even figure out how to use the 
filenames as input. Anyone care to offer me a hint?


I also downloaded an e-book called Learning Perl (OReilly, 
4th.Edition), and had a quick look thru its Contents of Table, but did 
not find any chapter which looks likely addressing any issue related to 
my question.


Thank you very much!

Zhao
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-12 Thread Ben Scott
On 1/12/06, Zhao Peng [EMAIL PROTECTED] wrote:
 I'm back, with another extract string question. //grin

  It sounds like you could use a tutorial on Unix text processing and
command line tools, specifically, one which addresses pipes and
redirection, as well as the standard text tools (grep, cut, sed, awk,
etc.).  While Paul's recommendation about the O'Reilly regular
expressions book is valid, I suspect it might be a little too focused
on regex's and not cover some of the *other* elements you seem to be
needing.

  It's been forever for me, but I seem to recall that _Unix Power
Tools_, also published by O'Reilly, covers all of the above and much,
much more.  If others on this list second my suggestion, you might
want to obtain a copy.  Alternatively, maybe list members can suggest
alternatives?

  There are also a number of free guides at the Linux Documentation
Project.  See:

http://www.tldp.org/guides.html

  Look for anything mentioning bash (the Bourne-again shell) or
scripting.  I can't speak as to how good they are, but you can't beat
the price.

  Anyway, on to your question...

 I tried to use cut commands, but can't even figure out how to use the
 filenames as input. Anyone care to offer me a hint?

  You'll want to pipe the output of ls to cut.  This should get you started:

  ls -1 | cut -d _ -f 2

  The -1 switch to ls(1) tells it to output a single column of file
names.  Some versions of ls do this automagically when using
redirection, but it is best to be sure.  The -d _ switch to cut(1)
tells cut to split fields on the underscore.  The -f 2 selects the
second field.

  See also: sort(1), uniq(1)

  Hope this helps!

-- Ben Unix plumber Scott
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-12 Thread Python
On Thu, 2006-01-12 at 19:40 -0500, Zhao Peng wrote:
 For example:
 abc_st_nh_num.sas7bdat
 abc_st_vt_num.sas7bdat
 abc_st_ma_num.sas7bdat
 abcd_region_NewEngland_num.sas7bdat
 abcd_region_South_num.sas7bdat

You're not the only one learning here.  

I put these names into a file called str2-test-data

$ cut -d _ -f 2 str2-test-data | sort | uniq
region
st

I think that you could use:
ls | cut -d _ -f 2 | sort | uniq  str2-results.txt

-- 
Lloyd Kvam
Venix Corp

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: extract string from filename

2006-01-12 Thread Kevin D. Clark

Zhao Peng writes:

 I'm back, with another extract string question. //grin


find FOLDERNAME -name \*sas7bdat -print | sed 's/.*\///' | cut -d _ -f 2 | sort 
-u  somefile.txt

or

perl -MFile::Find -e 'find(sub{$string2 = (split /_/)[2]; $seen{$string2}++; }, 
@ARGV); map { print $_\n; } keys(%seen)' FOLDERNAME

(which looks more readable as:

  perl -MFile::Find -e 'find(sub{ $string2 = (split /_/)[2];
  $seen{$string2}++;
 }, @ARGV);
  
 map { print $_\n; } keys(%seen)' \
  FOLDERNAME  somefile.txt

)

Either of which solves the problem that you describe.  Actually, they
solve more than the problem that you describe, since it wasn't
apparent to me if you had any subdirectories here, but this is solved too)

(substitute FOLDERNAME with your directory's name)


Honestly, the first solution I present is the way I would have solved
this problem myself.  Very fast this way.

Regards,

--kevin
-- 
(There are also also 228 babies named Unique during the 1990s alone,
and 1 each of Uneek, Uneque, and Uneqqee.)

-- _Freakonomics_, Steven D. Levitt and Stephen J. Dubner


[but no Unix folks named their kids uniq, apparently.  --kevin]

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss