On 1/11/06, Zhao Peng <[EMAIL PROTECTED]> wrote:
Hi All,
First I really cannot be more grateful for the answers to my question
from all of you, I appreciate your help and time. I'm especially touched
by the outpouring of response on this list., which I have never
experienced  before anywhere else.
 
  I hope my little comment didn't seem mean, I was more poking fun at the fact that if someone posted a simular post, and called themselves a Systems Administrator on a Windows network, comments simular to mine would have come forth..  ;-)
 
Secondly I'm sorry for the big stir-up as to "homework problems" which
flooded the list, since I'm origin of it.
 
  Nah, it wasn't a flood.  Trust me, once you see a flood, you'll know it.  Usually, it's becouse someone says something political in nature.

 
Kenny, regarding missing column issue, let me try to explain it again.
Below is quoted from my original post:
============================================
Also, if one column is missing, and "," is used to indicate that missing
column, like the following (2nd column of 3rd line is missing):
"name","age","school"
"jerry" ,"21","univ of Vermont"
"jesse",,,"Dartmouth college"
"jack","18","univ of Penn"
"john","20","univ of south Florida"
===========================================
You said that "there is an extra column in the 3rd line". I disagree
with you from my perspective. As you can see, there are 3 commas in
between "jesse" and "Dartmouth college". For these 3 commas, again, if
we think the 2nd one as an merely indication that the value for age
column is missing, then the 3rd line will be be read as ["jesse",
MISSING, "Dartmouth college"], not ["jesse",empty,empty, "Dartmouth
college"] as you suggested.
 
  This is unusual, as typically, a comma delimited set of values would simply have nothing between the commas, or a set of quotes with no data.
 
  Typically the line would look like this:
 
"jesse",,"Dartmouth college"
 
  Or
 
"jesse","","Dartmouth college"

 
Paul, as to your "simplest by what measurement" question. I was thinking
of both "easiest to remember" and "easiest to understand" when I was
posting my question. Now I desire for "most efficient" approach. I know
that will be my homework.
 
  If this is something that you will be doing repeatedly for different files types, I'd highly suggest getting familiar with regular expressions.  You've seen a small snippet in Kenny's example 'sed s/\"//g'.  The 's/\"//g' says to globally replace all quotes with nothing (s = substitute, /1/2/ says 'replace everything matching 1 with 2', in this case, a quote, with nothing.  g means globally, aka, do it more then just once.  Regular expressions are a powerful way to parse text files based on a given pattern, to get at the data you want.

 
Part of my primary job responsibilities is to convert raw data into SAS
data sets. My "extract string" question comes from processing a raw data
file in .txt format, which doesn't have any documentation, except the
variable list. By looking at the raw data, I know that each variable is
separated by a comma. For one particular variable(column) called
"school", the length of some of its value is quite long(like: Univ of
Wisconsin at Madison, Health Sci Ctr), but I don't know the definite
length. I need to know it, because if the length I specify it not
enough, only partial values will be read. Many of its values contain
"univ", so I just thought if I could extract all strings containing
"univ" from that variable(column), I will have a better chance to figure
out the length of "school". That's why I had this question.
 
  Haven't even run it, but something perl like:
 
my $maxlen = 0;
while(<>) {
  /^(.*),(.*),(.*)$/;
  if(length($3) > $maxlen) {
    $maxlen = $3;
  }
}
print "Longest String in third column is $maxlen\n";
 
  This would read on STDIN till it couldn't read anymore.  Each line, it would split based on the commas (If the third column contains commas, this won't work, becouse $2 or $1 would be greedy and gobble some of the data, FYI), and check the length of the third field against max length.  If it's longer, assign it.  At the end, print it out.
 
  This Regular _expression_ isn't great, but it's the 20 second typing version.
 
  Thomas

Reply via email to