William Martell <[EMAIL PROTECTED]> wrote: ============================================================ NOTE: This response is long. If you want to ask a question be responsible and delete anything that is not relevant to your question. Do NOT top post and repeat the entire message. Do NOT use the Outlook Excuse(TM). ============================================================
: Hello Mr. Clarkson, That's my father's name. Please call me Charles. :) : When I run my code on the attached test data file, I get : this. : : [RESULT] : " : price: $825","beds: 3","baths: 1"," : price: $825","phone: 214-821-2794","address: 2731 San ... : View ","paper: The Dallas Morning News", ... [snipped some] : price: $975","beds: 4","baths: 2"," : price: $975","phone: 469-569-0099","location: ... : Morning News","date: 04/04/2004", : : [/RESULT] : : My desired result is this, (I want to be able to import : this data into excel) : : [DESIRED] : price: beds: bath: phone: address: location: paper: date: : data, data, data, data, data, data, data, data : [/DESIRED] There are a lot of ways to approach this. Two come to mind immediately. We could print each line as you complete an advertisement or we could collect all the information and print the lines at once. We'll look at first of these approaches below. : The problem is that I am getting duplicate matches with my : regular expressions Yes, I noticed that. I suggest a different way to view your data below. : and I am not getting the same fields on each record. We'll handle this with default fields. : I would like to be able to import this data in excel from : a csv text file. For our other readers: A csv text file is a file where each field is in double quotes and separated by a comma. Our program could be broken down this way: 1. extract fields from text file for one ad. 2. print that ad as a csv line. 3. rinse and repeat. ============================================================ NOTE: This response is long. If you want to ask a question be responsible and delete anything that is not relevant to your question. Do NOT top post and quote the entire message. ============================================================ I'm going to approach this from the csv file end first. Let's pretend that we already have extracted the fields from an ad. To make everything easier (on me), I'm going to use a hash to hold an ad. my %ad = ( location => 'Mesquite', price => 725, bedrooms => 3, bathrooms => 1, living_areas => '', phone => '972-289-2098', arrangement => '3/2/2', address => '', paper => 'The Dallas Morning News', date => '04/08/2004' ); Now we will create a subroutine that takes %ad and converts it to a csv line. Feel free to ask questions about anything you don't understand. (Be certain to trim this response to save repeating the whole thing.) print csv( \%ad ), "\n"; sub csv { my $ad = shift; # if you need to change field order # do it here my @field_order = qw| price bedrooms bathrooms phone address arrangement location paper date |; # add double quotes (") foreach my $value ( values %$ad ) { $value = qq("$value"); } # use a hash slice to order the fields return join ',', @$ad{ @field_order }; } One advantage to csv() is the ease of adding column names to the csv file: # column names print csv( { date => 'Date', price => 'Price', bedrooms => 'Bedrooms', bathrooms => 'Bathrooms', living_areas => 'Living_areas', phone => 'Phone', address => 'Address', arrangement => 'Arrangement', location => 'Location', paper => 'Paper', } ), "\n"; ============================================================ NOTE: This response is long. If you want to ask a question be responsible and delete anything that is not relevant to your question. Do NOT top post and repeat the entire message. ============================================================ Now let's look at how to get the ads into a hash. Here's two ads for those just joining us: ---------------------------------------------------------------------------- ---- Ads basket. Check to add. Uncheck to remove. Price: $725 BD: 3 BA: 1 LA: Home 3bd-1ba-1,ch/a new carpet & paint fence stove $725+dep. 972-289-2098 Mesquite - The Dallas Morning News - 04/08/2004 ---------------------------------------------------------------------------- ---- Ads basket. Check to add. Uncheck to remove. Price: $950 BD: 3 BA: 1.5 LA: Home 3bd/1.5ba/2 1620 Summitt $950+dep Sec 8ok 469-525-0204; 214-957-8176 Mesquite - The Dallas Morning News - 04/08/2004 At first glance the 80 '-' dashes look like the start to a new advertisement. But someone might enter that into the description field. So let's use the Price line as the first line of each ad. For this solution I will place the data below the __END__ tag in the script. This allows us to retrieve the data with the special file handle DATA. I won't add the data and the sub to each rewrite of the script to save space. In fact, let's just rewrite the loop only. We start with a simple loop. (It doesn't do anything yet!): # column names print csv( { date => 'Date', price => 'Price', bedrooms => 'Bedrooms', bathrooms => 'Bathrooms', living_areas => 'Living_areas', phone => 'Phone', address => 'Address', arrangement => 'Arrangement', location => 'Location', paper => 'Paper', } ), "\n"; while ( <DATA> ) { } sub csv { my $ad = shift; # if you need to change field order # do it here my @field_order = qw| price bedrooms bathrooms phone address location paper date |; # add double quotes (") foreach my $value ( values %$ad ) { $value = qq("$value"); } # use a hash slice to order the fields return join ',', @$ad{ @field_order }; } __END__ ---------------------------------------------------------------------------- ---- Ads basket. Check to add. Uncheck to remove. Price: $725 BD: 3 BA: 1 LA: Home 3bd-1ba-1,ch/a new carpet & paint fence stove $725+dep. 972-289-2098 Mesquite - The Dallas Morning News - 04/08/2004 ---------------------------------------------------------------------------- ---- Ads basket. Check to add. Uncheck to remove. Price: $950 BD: 3 BA: 1.5 LA: Home 3bd/1.5ba/2 1620 Summitt $950+dep Sec 8ok 469-525-0204; 214-957-8176 Mesquite - The Dallas Morning News - 04/08/2004 ============================================================ NOTE: This response is long. If you want to ask a question be responsible and delete anything that is not relevant to your question. Do NOT top post and repeat the entire message. ============================================================ Now let's rewrite the loop to check for a new ad: while ( <DATA> ) { print "new ad\n" if /^Price/; } I get this (the first line may wrap): "Price","Bedrooms","Bathrooms","Phone","Address","Location","Paper","Date" new ad new ad That seems to be the correct number of ads. So now we know when we are starting a new ad. Let's assume that each in each ad we can't count on any field existing. So it is possible to get an ad that looks like this: { date => '', price => '', bedrooms => '', bathrooms => '', living_areas => '', phone => '', address => '', arrangement => '', location => '', paper => '', } Let's assume that we don't want to include such an ad in our csv file. We'll create a function that tells us if %ad has any values that are filled in. We call it is_valid_ad() and it will return true if the ad has at least one field filled in and false otherwise. sub is_valid_ad { my $ad = shift; foreach $value ( values @$ad ) { return 1 unless $value eq ''; } return 0; } There are a lot of other ways to write this, but we'll use this for now. This part of our program might look like this now: my %ad; while ( <DATA> ) { if ( /^Price/ ) { # add the last ad if it exists print csv( \%ad ), "\n" if is_valid_ad( \%ad ); # reset ad %ad = ( date => '', price => '', bedrooms => '', bathrooms => '', living_areas => '', phone => '', address => '', arrangement => '', location => '', paper => '', ); } } sub is_valid_ad { my $ad = shift; foreach my $value ( values %$ad ) { # return true if one field is valid return 1 unless $value eq ''; } # return false if we get here return; } Notice that this neatly bypasses printing an ad on the first pass. There are now a whole bunch smarty pants readers with there hand in the air shouting "Oooh, Oooh, Oooh!". The problem with our loop is the last ad won't print. We need to add a final after the while loop: print csv( \%ad ), "\n" if is_valid_ad( \%ad ); [Please put your hands down!] In the original data provided, the Home section can sometimes be multiple lines. We are going to skip extacting the address because it would probably a depend ona routine bordering on Artificial Intelligence to be 99% accurate. The same goes for 'arrangement' actually. Being a real estate investor as well as a perl programmer, I can tell you that each person has a different idea about what 3/2/2/3 means. It also changes by locale. For this response, let's skip those two fields completely, it is up to the reader to find a reliable method of extracting them. We will include those fields in the script, we just won't fill them in. Finally, we are assuming that we have permission from the newspaper to do this or that we are doing it for personal use only. Let's look at the regex needed to get the price and number of rooms first. Remember that any field could be blank and that each field and name are separated by white space. I am going to take a longer approach to this regex. Readers might find it inefficient. Feel free to play with it and post it to the list. (Remember not to quote this entire message though.) Here's a script I wrote to test the "price" regex: #!/usr/bin/perl use strict; use warnings; #use diagnostics; use Data::Dumper 'Dumper'; my @test = ( 'Price: $725 BD: 3 BA: 1 LA:', 'Price: BD: 3 BA: 1 LA: 2', 'Price: $725 BD: 3 BA: 1 LA: 2', 'Price: $725 BD: 3 BA: LA:', ); foreach my $sample ( @test ) { my @fields = split /Price:|BD:|BA:|LA:/, $sample; print Dumper [EMAIL PROTECTED]; } __END__ It came up with this output (I edited it to save space): $VAR1 = [ '', ' $725 ', ' 3 ', ' 1 ' ]; $VAR1 = [ '', ' ', ' 3 ', ' 1 ', ' 2' ]; $VAR1 = [ '', ' $725 ', ' 3 ', ' 1 ', ' 2' ]; $VAR1 = [ '', ' $725 ', ' 3 ' ]; That's close but I found this better: foreach my $sample ( @test ) { $sample =~ s/ //g; my @fields = split /Price:|BD:|BA:|LA:/, $sample; shift @fields; push @fields, ( '' ) x ( 4 - @fields ); print Dumper [EMAIL PROTECTED]; } $VAR1 = [ '$725', '3', '1', '' ]; $VAR1 = [ '', '3', '1', '2' ]; $VAR1 = [ '$725', '3', '1', '2' ]; $VAR1 = [ '$725', '3', '', '' ]; Let's make this a subroutine and test it: foreach my $sample ( @test ) { print Dumper [ price_fields( $sample ) ]; } sub price_fields { my $line = shift; # collapse all spaces $line =~ s/\s//g; my @fields = split /Price:|BD:|BA:|LA:/, $line; # delete extra first field shift @fields; # pad missing fields push @fields, ( '' ) x ( 4 - @fields ); return @fields; } __END__ I get the same result. Now let's add that to our script: my %ad; while ( <DATA> ) { if ( /^Price/ ) { # add the last ad if it exists print csv( \%ad ), "\n" if is_valid_ad( \%ad ); # Set ad defaults %ad = ( date => '', price => '', bedrooms => '', bathrooms => '', living_areas => '', phone => '', address => '', arrangement => '', location => '', paper => '', ); # update %ad for this line @ad{ qw| price bedrooms bathrooms living_areas | } = price_fields( $_ ); } } sub price_fields { my $line = shift; # collapse all spaces $line =~ s/\s//g; my @fields = split /Price:|BD:|BA:|LA:/, $line; # delete extra first field shift @fields; # pad missing fields push @fields, ( '' ) x ( 4 - @fields ); return @fields; } As I mentioned earlier we will be skipping the 'Home' line so I can cheat. :) So let's add that: my %ad; while ( <DATA> ) { if ( /^Price/ ) { # add the last ad if it exists print csv( \%ad ), "\n" if is_valid_ad( \%ad ); # Set ad defaults %ad = ( date => '', price => '', bedrooms => '', bathrooms => '', living_areas => '', phone => '', address => '', arrangement => '', location => '', paper => '', ); # update %ad for this line @ad{ qw| price bedrooms bathrooms living_areas | } = price_fields( $_ ); } if ( /^Home/ ) { @ad{ qw| address arrangement | } = home_fields( $_ ); } } sub home_fields { # This is a dummy function return ( '', '' ); } And now for the location line. I used this regex. The x option allows us to add white space: m/ (\S+) # location ($1) \s+-\s+ ([^-]+) # paper with trailing spaces ($2) \s+ (\d\d\/\d\d\/\d{4}) # date ($3) /x So we add the following to our loop: if ( m/ (\S+) # location ($1) \s+-\s+ ([^-]+) # paper and spaces ($2) -\s+ (\d\d\/\d\d\/\d{4}) # date ($3) /x ) { # use hash slice to set fields @ad{ qw| location paper date | } = ( $1, $2, $3 ); # trim trailing spaces in paper field $ad{paper} =~ s/\s+$//; } Boy, that looks messy! But it works! :) Let's take a final look at the script and let's look at some ways to improve future script maintenance. #!/usr/bin/perl use strict; use warnings; #use diagnostics; use Data::Dumper 'Dumper'; # column names print csv( { date => 'Date', price => 'Price', bedrooms => 'Bedrooms', bathrooms => 'Bathrooms', living_areas => 'Living_areas', phone => 'Phone', address => 'Address', location => 'Location', arrangement => 'Arrangemant', paper => 'Paper', } ), "\n"; my %ad; while ( <DATA> ) { if ( /^Price/ ) { # add the last ad if it exists print csv( \%ad ), "\n" if is_valid_ad( \%ad ); # Set field defaults %ad = ( date => '', price => '', bedrooms => '', bathrooms => '', living_areas => '', phone => '', address => '', arrangement => '', location => '', paper => '', ); # update %ad for this line @ad{ qw| price bedrooms bathrooms living_areas | } = price_fields( $_ ); } if ( /^Home/ ) { @ad{ qw| address arrangement | } = home_fields( $_ ); } if ( m/ (\S+) # location ($1) \s+-\s+ ([^-]+) # paper with spaces ($2) -\s+ (\d\d\/\d\d\/\d{4}) # date ($3) /x ) { # use hash slice to set fields @ad{ qw| location paper date | } = ( $1, $2, $3 ); # trim trailing spaces in paper field $ad{paper} =~ s/\s+$//; } } print csv( \%ad ), "\n" if is_valid_ad( \%ad ); sub home_fields { # This is a dummy function return ( '', '' ); } sub price_fields { my $line = shift; # collapse all spaces $line =~ s/\s//g; my @fields = split /Price:|BD:|BA:|LA:/, $line; # delete extra first field shift @fields; # pad missing fields push @fields, ( '' ) x ( 4 - @fields ); return @fields; } sub is_valid_ad { my $ad = shift; foreach my $value ( values %$ad ) { # return true if one field is valid return 1 unless $value eq ''; } # return false if we get here return; } sub csv { my $ad = shift; # if you need to change field order # do it here my @field_order = qw| price bedrooms bathrooms phone address arrangement location paper date |; # add double quotes (") foreach my $value ( values %$ad ) { $value = qq("$value"); } # use a hash slice to order the fields return join ',', @$ad{ @field_order }; } __END__ ---------------------------------------------------------------------------- ---- Ads basket. Check to add. Uncheck to remove. Price: $725 BD: 3 BA: 1 LA: Home 3bd-1ba-1,ch/a new carpet & paint fence stove $725+dep. 972-289-2098 Mesquite - The Dallas Morning News - 04/08/2004 ---------------------------------------------------------------------------- ---- Ads basket. Check to add. Uncheck to remove. Price: $950 BD: 3 BA: 1.5 LA: Home 3bd/1.5ba/2 1620 Summitt $950+dep Sec 8ok 469-525-0204; 214-957-8176 Mesquite - The Dallas Morning News - 04/08/2004 Here's the output I got: "Price","Bedrooms","Bathrooms","Phone","Address","Arrangemant","Location","P aper","Date" "$725","3","1","","","","Mesquite","The Dallas Morning News","04/08/2004" "$950","3","1.5","","","","Mesquite","The Dallas Morning News","04/08/2004" Right off we can see your intrepid instructor forgot the 'phone' field. I'll leave that up to you. The other thing I notice is that we need to update too many areas to add or change fields. That could really suck down the line, but this reply is pushing 750 lines, so I'll give everyone a break and shut up. :) HTH, Charles K. Clarkson -- Mobile Homes Specialist 254 968-8328 ============================================================ NOTE: This response is long. If you want to ask a question be responsible and delete anything that is not relevant to your question. Do NOT top post and repeat the entire message. Do NOT use the Outlook Excuse(TM). ============================================================ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>