stas 02/03/03 03:27:22
Modified: src/search .swishcgi.conf README SwishSpiderConfig.pl
search.tt spider.pl swish.cgi swish.conf
Log:
- updating search utility and configs
Submitted by: Bill Moseley <[EMAIL PROTECTED]>
Reviewed by: stas
Revision Changes Path
1.2 +16 -0 modperl-docs/src/search/.swishcgi.conf
Index: .swishcgi.conf
===================================================================
RCS file: /home/cvs/modperl-docs/src/search/.swishcgi.conf,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- .swishcgi.conf 30 Jan 2002 06:35:00 -0000 1.1
+++ .swishcgi.conf 3 Mar 2002 11:27:21 -0000 1.2
@@ -6,5 +6,21 @@
options => {
INCLUDE_PATH => '.',
},
+ },
+ select_by_meta => {
+ #method => 'radio_group', # pick: radio_group, popup_menu, or
checkbox_group
+ method => 'checkbox_group',
+ #method => 'popup_menu',
+ columns => 6,
+ metaname => 'section', # Can't be a metaname used elsewhere!
+ values => [qw/about contribute docs download maillist products
stats stories support/],
+ labels => {
+ about => 'About mod_perl',
+ doc => 'Documentation',
+ stories => 'Sucess Stories',
+ support => 'Support',
},
+ description => 'Limit search to these areas: ',
+ },
+
};
1.2 +39 -6 modperl-docs/src/search/README
Index: README
===================================================================
RCS file: /home/cvs/modperl-docs/src/search/README,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- README 4 Feb 2002 09:22:27 -0000 1.1
+++ README 3 Mar 2002 11:27:22 -0000 1.2
@@ -12,20 +12,28 @@
Indexing:
---------
-1. normally build the site:
+1. Set an environment variable to the path of the site:
- % bin/build -f (-d to build pdfs)
+ export MODPERL_SITE='http://perl.org'
-which among other things creates the dir: dst_html/search
+or
+
+ export MODPERL_SITE='http://localhost:4000/dst_html'
+
+This is used as the base for spidering, plus is used to determine
+the sections of the site (for limiting the site to those sections.
+
-2. check that swish.conf points to the right base URL, e.g.:
+2. normally build the site:
- SwishProgParameters default http://localhost/modperl-site/
+ % bin/build -f (-d to build pdfs)
+
+which among other things creates the dir: dst_html/search
3. Index the site
% cd dst_html/search
- % swish-e -S prog -c swish.conf
+ % ./swish-e -S prog -c swish.conf
You should see something like:
@@ -81,3 +89,28 @@
+How does indexing work?
+-----------------------
+
+Swish is run with a config file, and is run in a mode that says
+to use an external program to fetch documents. That external program
+is called spider.pl (part of the swish-e distribution).
+
+spider.pl uses a config file (by default) of SwishSpiderConfig.pl. This file
+builds an array of hashes (in this case a sinlge hash in the array). This
hash
+is the config.
+
+Part of the config are call-back functions that spider.pl will call while
spidering.
+One says to skip image files. Another one is a bit more tricky. It splits
a document into
+sections, creates new "sub-pages" that are complete HTML pages, and calls
the function in spider.pl
+that sends those off to swish for indexing. (That function then returns
false to tell swish not to
+index that document since the sections have already been indexed.)
+
+That's about it.
+
+One trick. For debugging you can run the spider without indexing.
+
+ ./spider.pl > bigfile.out
+
+Another trick, you can send SIGHUP to spider.pl while indexing and
+it will stop spidering, but let swish index what's been read so far.
1.3 +20 -8 modperl-docs/src/search/SwishSpiderConfig.pl
Index: SwishSpiderConfig.pl
===================================================================
RCS file: /home/cvs/modperl-docs/src/search/SwishSpiderConfig.pl,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- SwishSpiderConfig.pl 7 Feb 2002 07:26:15 -0000 1.2
+++ SwishSpiderConfig.pl 3 Mar 2002 11:27:22 -0000 1.3
@@ -2,13 +2,19 @@
#
# a few custom callbacks are located after the @servers definition section.
+
+
+my $base_path = $ENV{MODPERL_SITE} || die "must set \$ENV{MODPERL_SITE}";
+
+die "Don't use trailing slash in MODPERL_SITE" if $base_path =~ m!/$!;
+
+
@servers = (
{
- base_url => 'http://mardy:40994/dst_html/index.html',
+ base_url => "$base_path/index.html",
# Debugging -- see perldoc spider.pl
- #base_url =>
'http://mardy.hank.org:40994/dst_html/docs/guide/index.html',
#max_depth => 1,
#debug => DEBUG_HEADERS,
#debug => DEBUG_URL|DEBUG_SKIPPED|DEBUG_INFO,
@@ -21,12 +27,9 @@
delay_min => .0001,
+
# Ignore images files
- test_url => sub {
- return if $_[0]->path =~ /\.(?:gif|jpeg|.png|.gz)$/i;
- return unless $_[0]->path =~ m!^/preview/modperl-site!;
- return 1;
- },
+ test_url => sub { return $_[0]->path !~ /\.(?:gif|jpeg|.png|.gz)$/i
},
# Only index text/html
test_response => sub { return $_[2]->content_type =~ m[text/html]
},
@@ -35,7 +38,7 @@
filter_content => \&split_page,
# optionally validate external links
- validate_links => 1,
+ validate_links => $ENV{VALIDATE_LINKS} || 0,
},
);
@@ -92,11 +95,20 @@
$head->push_content( $title );
}
+ # Extract out part of the path to use for limiting searches to parts of
the document tree.
+
+ if ( $uri =~ m!$base_path/([^/]+)/.+$! ) {
+ my $meta = HTML::Element->new('meta', name=> 'section', content =>
$1);
+ $head->push_content( $meta );
+ }
+
+
my $body = HTML::Element->new('body');
my $doc = HTML::Element->new('html');
$body->push_content( $section );
$doc->push_content( $head, $body );
+
my $new_content = $doc->as_HTML(undef,"\t");
output_content( $params->{server}, \$new_content,
1.4 +1 -0 modperl-docs/src/search/search.tt
Index: search.tt
===================================================================
RCS file: /home/cvs/modperl-docs/src/search/search.tt,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -r1.3 -r1.4
--- search.tt 4 Feb 2002 07:16:43 -0000 1.3
+++ search.tt 3 Mar 2002 11:27:22 -0000 1.4
@@ -15,6 +15,7 @@
[% PROCESS search_form %]
[% PROCESS nav_bar %]
[% PROCESS results_list %]
+ [% IF search.navigation('hits') > search.config('page_size');
PROCESS nav_bar; END %]
[% END %]
[% END %]
1.3 +222 -30 modperl-docs/src/search/spider.pl
Index: spider.pl
===================================================================
RCS file: /home/cvs/modperl-docs/src/search/spider.pl,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- spider.pl 31 Jan 2002 01:51:50 -0000 1.2
+++ spider.pl 3 Mar 2002 11:27:22 -0000 1.3
@@ -2,7 +2,7 @@
use strict;
-# $Id: spider.pl,v 1.2 2002/01/31 01:51:50 stas Exp $
+# $Id: spider.pl,v 1.3 2002/03/03 11:27:22 stas Exp $
#
# "prog" document source for spidering web servers
#
@@ -23,7 +23,7 @@
use HTML::Tagset;
use vars '$VERSION';
-$VERSION = sprintf '%d.%02d', q$Revision: 1.2 $ =~ /: (\d+)\.(\d+)/;
+$VERSION = sprintf '%d.%02d', q$Revision: 1.3 $ =~ /: (\d+)\.(\d+)/;
use vars '$bit';
use constant DEBUG_ERRORS => $bit = 1; # program errors
@@ -36,10 +36,13 @@
use constant MAX_SIZE => 5_000_000; # Max size of document to fetch
+use constant MAX_WAIT_TIME => 30; # request time.
#Can't locate object method "host" via package "URI::mailto" at
../prog-bin/spider.pl line 473.
#sub URI::mailto::host { return '' };
+
+# This is not the right way to do this.
sub UNIVERSAL::host { '' };
sub UNIVERSAL::port { '' };
sub UNIVERSAL::host_port { '' };
@@ -62,7 +65,7 @@
print STDERR "$0: Reading parameters from '$config'\n";
my $abort;
- local $SIG{HUP} = sub { $abort++ };
+ local $SIG{HUP} = sub { warn "Caught SIGHUP\n"; $abort++ } unless $^O =~
/Win32/i;
my %visited; # global -- I suppose would be smarter to localize it per
server.
@@ -74,8 +77,9 @@
die "You must specify 'base_url' in your spider config
settings\n";
}
- for (ref $s->{base_url} eq 'ARRAY' ? @{$s->{base_url}} :
$s->{base_url} ) {
- $s->{base_url} = $_;
+ my @urls = ref $s->{base_url} eq 'ARRAY' ? @{$s->{base_url}} :(
$s->{base_url});
+ for my $url ( @urls ) {
+ $s->{base_url} = $url;
process_server( $s );
}
}
@@ -100,12 +104,18 @@
# set defaults
$server->{debug} ||= 0;
- $server->{debug} = 0 unless $server->{debug} =~ /^\d+$/;
+ die "debug parameter '$server->{debug}' must be a number\n" unless
$server->{debug} =~ /^\d+$/;
$server->{max_size} ||= MAX_SIZE;
die "max_size parameter '$server->{max_size}' must be a number\n" unless
$server->{max_size} =~ /^\d+$/;
+
+ $server->{max_wait_time} ||= MAX_WAIT_TIME;
+ die "max_wait_time parameter '$server->{max_wait_time}' must be a
number\n" if $server->{max_wait_time} !~ /^\d+/;
+
+
+
$server->{link_tags} = ['a'] unless ref $server->{link_tags} eq 'ARRAY';
$server->{link_tags_lookup} = { map { lc, 1 } @{$server->{link_tags}} };
@@ -139,14 +149,32 @@
my $uri = URI->new( $server->{base_url} );
$uri->fragment(undef);
+ if ( $uri->userinfo ) {
+ die "Can't specify parameter 'credentials' because base_url defines
them\n"
+ if $server->{credentials};
+ $server->{credentials} = $uri->userinfo;
+ $uri->userinfo( undef );
+ }
+
+
print STDERR "\n -- Starting to spider: $uri --\n" if $server->{debug};
# set the starting server name (including port) -- will only spider on
server:port
- $server->{authority} = $uri->authority;
- $server->{same} = [ $uri->authority ];
+
+ # All URLs will end up with this host:port
+ $server->{authority} = $uri->canonical->authority;
+
+ # All URLs must match this scheme ( Jan 22, 2002 - spot by Darryl
Friesen )
+ $server->{scheme} = $uri->scheme;
+
+
+
+ # Now, set the OK host:port names
+ $server->{same} = [ $uri->canonical->authority ];
+
push @{$server->{same}}, @{$server->{same_hosts}} if ref
$server->{same_hosts};
$server->{same_host_lookup} = { map { $_, 1 } @{$server->{same}} };
@@ -169,8 +197,9 @@
my $ua;
+
if ( $server->{ignore_robots_file} ) {
- $ua = LWP::UserAgent->new( );
+ $ua = LWP::UserAgent->new;
return unless $ua;
$ua->agent( $server->{agent} );
$ua->from( $server->{email} );
@@ -181,6 +210,9 @@
$ua->delay( $server->{delay_min} || 0.1 );
}
+ # Set the timeout on the server and using Windows.
+ $ua->timeout( $server->{max_wait_time} ) if $^O =~ /Win32/i;
+
$server->{ua} = $ua; # save it for fun.
# $ua->parse_head(0); # Don't parse the content
@@ -224,6 +256,56 @@
}
}
+
+#-----------------------------------------------------------------------
+# Deal with Basic Authen
+
+
+
+# Thanks Gisle!
+sub get_basic_credentials {
+ my($uri, $server, $realm ) = @_;
+ my $netloc = $uri->canonical->host_port;
+
+ my ($user, $password);
+
+ eval {
+ local $SIG{ALRM} = sub { die "timed out\n" };
+ alarm( $server->{credential_timeout} || 30 ) unless $^O =~ /Win32/i;
+
+ if ( $uri->userinfo ) {
+ print STDERR "\nSorry: invalid username/password\n";
+ $uri->userinfo( undef );
+ }
+
+
+ print STDERR "Need Authentication for $uri at realm
'$realm'\n(<Enter> skips)\nUsername: ";
+ $user = <STDIN>;
+ chomp($user);
+ die "No Username specified\n" unless length $user;
+
+ alarm( $server->{credential_timeout} || 30 ) unless $^O =~ /Win32/i;
+
+ print STDERR "Password: ";
+ system("stty -echo");
+ $password = <STDIN>;
+ system("stty echo");
+ print STDERR "\n"; # because we disabled echo
+ chomp($password);
+
+ alarm( 0 ) unless $^O =~ /Win32/i;
+ };
+
+ return if $@;
+
+ return join ':', $user, $password;
+
+
+}
+
+
+
+
#----------- Non recursive spidering ---------------------------
sub spider {
@@ -275,9 +357,30 @@
$server->{no_index} = 0;
$server->{no_spider} = 0;
+
+ # Set basic auth if defined - use URI specific first, then credentials
+ if ( my ( $user, $pass ) = split /:/, ( $uri->userinfo ||
$server->{credentials} || '' ) ) {
+ $request->authorization_basic( $user, $pass );
+ }
+
+
+
+
my $been_here;
my $callback = sub {
+ # Reset alarm;
+ alarm( $server->{max_wait_time} ) unless $^O =~ /Win32/i;
+
+
+ # Cache user/pass
+ if ( $server->{cur_realm} && $uri->userinfo ) {
+ my $key = $uri->canonical->host_port . ':' .
$server->{cur_realm};
+ $server->{auth_cache}{$key} = $uri->userinfo;
+ }
+
+ $uri->userinfo( undef ) unless $been_here;
+
die "test_response" if !$been_here++ && !check_user_function(
'test_response', $uri, $server, $_[1], \$_[0] );
@@ -290,12 +393,55 @@
};
- my $response = $ua->simple_request( $request, $callback, 4096 );
+ my $response;
+
+ eval {
+ local $SIG{ALRM} = sub { die "timed out\n" };
+ alarm( $server->{max_wait_time} ) unless $^O =~ /Win32/i;
+ $response = $ua->simple_request( $request, $callback, 4096 );
+ alarm( 0 ) unless $^O =~ /Win32/i;
+ };
return if $server->{abort};
+ if ( $response && $response->code == 401 &&
$response->header('WWW-Authenticate') && $response->header('WWW-Authenticate')
=~ /realm="([^"]+)"/i ) {
+ my $realm = $1;
+
+ my $user_pass;
+
+ # Do we have a cached user/pass for this realm?
+ my $key = $uri->canonical->host_port . ':' . $realm;
+
+ if ( $user_pass = $server->{auth_cache}{$key} ) {
+
+ # If we didn't just try it, try again
+ unless( $uri->userinfo && $user_pass eq $uri->userinfo ) {
+ $uri->userinfo( $user_pass );
+ return process_link( $server, $uri, $parent, $depth );
+ }
+ }
+
+ # otherwise, prompt:
+
+
+ if ( $user_pass = get_basic_credentials( $uri, $server, $realm ) ) {
+ $uri->userinfo( $user_pass );
+
+ $server->{cur_realm} = $realm; # save so we can cache
+ my $links = process_link( $server, $uri, $parent, $depth );
+ delete $server->{cur_realm};
+
+ return $links;
+ }
+ print STDERR "Skipping $uri\n";
+ }
+
+ $uri->userinfo( undef );
+
+
+
# Log the response
if ( ( $server->{debug} & DEBUG_URL ) || ( $server->{debug} &
DEBUG_FAILED && !$response->is_success) ) {
@@ -322,6 +468,7 @@
return;
}
+ $response->request->uri->userinfo( undef );
# skip excluded by robots.txt
@@ -501,13 +648,15 @@
# which tags to use ( not reported in debug )
- print STDERR " ?? Looking at extracted tag '$tag'\n" if
$server->{debug} & DEBUG_LINKS;
+ my $attr = join ' ', map { qq[$_="$attr{$_}"] } keys %attr;
+
+ print STDERR "\nLooking at extracted tag '<$tag $attr>'\n" if
$server->{debug} & DEBUG_LINKS;
unless ( $server->{link_tags_lookup}{$tag} ) {
# each tag is reported only once per page
print STDERR
- " ?? <$tag> skipped because not one of (",
+ " <$tag> skipped because not one of (",
join( ',', @{$server->{link_tags}} ),
")\n" if $server->{debug} & DEBUG_LINKS &&
!$skipped_tags{$tag}++;
@@ -539,18 +688,14 @@
next unless check_link( $u, $server, $base, $tag, $attribute
);
push @links, $u;
- print STDERR qq[ ++ <$tag $attribute="$u"> Added to list of
links to follow\n] if $server->{debug} & DEBUG_LINKS;
+ print STDERR qq[ $attribute="$u" Added to list of links to
follow\n] if $server->{debug} & DEBUG_LINKS;
$found++;
}
}
if ( !$found && $server->{debug} & DEBUG_LINKS ) {
- my $s = "<$tag";
- $s .= ' ' . qq[$_="$attr{$_}"] for sort keys %attr;
- $s .= '>';
-
- print STDERR " ?? tag $s did not include any links to follow\n";
+ print STDERR " tag did not include any links to follow or is a
duplicate\n";
}
}
@@ -599,15 +744,15 @@
# Here we make sure we are looking at a link pointing to the correct (or
equivalent) host
- unless ( $server->{same_host_lookup}{$u->authority} ) {
+ unless ( $server->{scheme} eq $u->scheme &&
$server->{same_host_lookup}{$u->canonical->authority} ) {
- print STDERR qq[ ?? <$tag $attribute="$u"> skipped because different
authority (server:port)\n] if $server->{debug} & DEBUG_LINKS;
+ print STDERR qq[ ?? <$tag $attribute="$u"> skipped because different
host\n] if $server->{debug} & DEBUG_LINKS;
$server->{counts}{'Off-site links'}++;
validate_link( $server, $u, $base ) if $server->{validate_links};
return;
}
- $u->authority( $server->{authority} ); # Force all the same host name
+ $u->host_port( $server->{authority} ); # Force all the same host name
# Allow rejection of this URL by user function
@@ -661,10 +806,10 @@
my $request = HTTP::Request->new('HEAD', $uri->canonical );
eval {
- $SIG{ALRM} = sub { die "timed out\n" };
- alarm 5;
+ local $SIG{ALRM} = sub { die "timed out\n" };
+ alarm( $server->{max_wait_time} ) unless $^O =~ /Win32/i;
$response = $ua->simple_request( $request );
- alarm 0;
+ alarm( 0 ) unless $^O =~ /Win32/i;
};
if ( $@ ) {
@@ -729,7 +874,6 @@
}
sub default_urls {
- die "$0: Must list URLs when using 'default'\n" unless @ARGV;
my $validate = 0;
if ( $ARGV[0] eq 'validate' ) {
@@ -737,6 +881,9 @@
$validate = 1;
}
+ die "$0: Must list URLs when using 'default'\n" unless @ARGV;
+
+
my @content_types = qw{ text/html text/plain };
return map {
@@ -786,9 +933,18 @@
},
);
- begin indexing:
+Begin indexing:
+
swish-e -S prog -c swish.config
+Note: When running on some versions of Windows (e.g. Win ME and Win 98 SE)
+you may need to index using the command:
+
+ perl spider.pl | swish-e -S prog -c swish.conf -i stdin
+
+This pipes the output of the spider directly into swish.
+
+
=head1 DESCRIPTION
This is a swish-e "prog" document source program for spidering
@@ -1013,6 +1169,19 @@
base_url => [qw! http://swish-e.org/
http://othersite.org/other/index.html !],
+You may specify a username and password:
+
+ base_url => 'http://user:[EMAIL PROTECTED]/index.html',
+
+but you may find that to be a security issue. If a URL is protected by
Basic Authentication
+you will be prompted for a username and password. This might be a slighly
safer way to go.
+
+The parameter C<max_wait_time> controls how long to wait for user entry
before skipping the
+current URL.
+
+See also C<credentials> below.
+
+
=item same_hosts
This optional key sets equivalent B<authority> name(s) for the site you are
spidering.
@@ -1034,9 +1203,9 @@
http://www.mysite.edu/path/to/file.html
-Note: This should probably be called B<same_authority> because it compares
the URI C<authority>
+Note: This should probably be called B<same_host_port> because it compares
the URI C<host:port>
against the list of host names in C<same_hosts>. So, if you specify a port
name in you will
-probably want to specify the port name in the the list of hosts in
C<same_hosts>:
+want to specify the port name in the the list of hosts in C<same_hosts>:
my %serverA = (
base_url => 'http://sunsite.berkeley.edu:4444/',
@@ -1076,6 +1245,16 @@
but in general you will probably want it much smaller. But, check with
the webmaster before using too small a number.
+=item max_wait_time
+
+This setting is the number of seconds to wait for data to be returned from
+the request. Data is returned in chunks to the spider, and the timer is
reset each time
+a new chunk is reported. Therefore, documents (requests) that take longer
than this setting
+should not be aborted as long as some data is received every max_wait_time
seconds.
+The default it 30 seconds.
+
+NOTE: This option has no effect on Windows.
+
=item max_time
This optional key will set the max minutes to spider. Spidering
@@ -1204,6 +1383,19 @@
Just a hack. If you set this true the spider will do HEAD requests all
links (e.g. off-site links), just
to make sure that all your links work.
+=item credentials
+
+You may specify a username and password to be used automatically when
spidering:
+
+ credentials => 'username:password',
+
+A username and password supplied in a URL will override this setting.
+
+=item credential_timeout
+
+Sets the number of seconds to wait for user input when prompted for a
username or password.
+The default is 30 seconds.
+
=back
=head1 CALLBACK FUNCTIONS
@@ -1445,7 +1637,7 @@
files to index only the document titles.
As shown above, you can turn this feature on for specific documents by
setting a flag in
-the server hash passed into the C<test_response> or C<filter_contents>
subroutines.
+the server hash passed into the C<test_response> or C<filter_content>
subroutines.
For example, in your configuration file you might have the C<test_response>
callback set
as:
@@ -1466,7 +1658,7 @@
HTML I<and> a title is found in the html document.
Note: In most cases you probably would not want to send a large binary file
to swish, just
-to be ignored. Therefore, it would be smart to use a C<filter_contents>
callback routine to
+to be ignored. Therefore, it would be smart to use a C<filter_content>
callback routine to
replace the contents with single character (you cannot use the empty string
at this time).
A similar flag may be set to prevent indexing a document at all, but still
allow spidering.
1.3 +978 -314 modperl-docs/src/search/swish.cgi
Index: swish.cgi
===================================================================
RCS file: /home/cvs/modperl-docs/src/search/swish.cgi,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- swish.cgi 4 Feb 2002 09:19:39 -0000 1.2
+++ swish.cgi 3 Mar 2002 11:27:22 -0000 1.3
@@ -2,17 +2,20 @@
package SwishSearch;
use strict;
-use lib qw( modules ); ### This must be adjusted!
+use lib qw( modules ); ### This may need to be adjusted!
+ ### It should point to the location of the
+ ### associated script modules directory
-####################################################################################
+
+###################################################################################
#
# If this text is displayed on your browser then your web server
# is not configured to run .cgi programs. Contact your web server
administrator.
#
# To display documentation for this program type "perldoc swish.cgi"
#
-# swish.cgi $Revision: 1.2 $ Copyright (C) 2001 Bill Moseley [EMAIL
PROTECTED]
+# swish.cgi $Revision: 1.3 $ Copyright (C) 2001 Bill Moseley [EMAIL
PROTECTED]
# Example CGI program for searching with SWISH-E
#
# This example program will only run under an OS that supports fork().
@@ -31,14 +34,13 @@
#
# The above lines must remain at the top of this program
#
-# $Id: swish.cgi,v 1.2 2002/02/04 09:19:39 stas Exp $
+# $Id: swish.cgi,v 1.3 2002/03/03 11:27:22 stas Exp $
#
####################################################################################
# This is written this way so the script can be used as a CGI script or a
mod_perl
# module without any code changes.
-
# use CGI (); # might not be needed if using Apache::Request
#=================================================================================
@@ -59,50 +61,11 @@
}
-#=================================================================================
-# mod_perl entry point
-#
-# As an example, you might use a PerlSetVar to point to paths to different
-# config files, and then cache the different configurations by path.
-#
-#=================================================================================
-
-my %cached_configs;
-
-sub handler {
- my $r = shift;
-
- if ( my $config_path = $r->dir_config( 'Swish_Conf_File' ) ) {
-
- # Already cached?
- if ( $cached_configs{ $config_path } ) {
- process_request( $cached_configs{ $config_path } );
- return Apache::Constants::OK();
- }
-
- # Else, load config
- my $config = default_config();
- $config->{config_file} = $config_path;
-
- # Merge with disk config file.
- $cached_configs{ $config_path } = merge_read_config( $config );
-
- process_request( $cached_configs{ $config_path } );
- return Apache::Constants::OK();
- }
-
-
- # Otherwise, use hard-coded config
- process_request( default_config() );
-
- return Apache::Constants::OK();
-
-}
-
#==================================================================================
-# This sets the default configuration
+# This sets the default configuration parameters
+#
# Any configuration read from disk is merged with these settings.
#
# Only a few settings are actually required. Some reasonable defaults are
used
@@ -140,18 +103,13 @@
sub default_config {
- # make the search of the swish-e executable more flexible. First
- # search in the PATH, then in the current dir.
- my $exec = `which swish-e`;
-#warn "found exec: $exec";
- chomp $exec;
- $exec ||= './swish-e';
- die "Cannot find swish-e" unless -x $exec;
+
##### Configuration Parameters #########
#---- This lists all the options, with many commented out ---
# By default, this config is used -- see the process_request() call
below.
+
# You should adjust for your site, and how your swish index was created.
##>>
@@ -159,11 +117,24 @@
##>>
##>> Send a small example, without all the comments.
- # Items beginning with an "x" or "#" are commented out
-
+ #======================================================================
+ # NOTE: Items beginning with an "x" or "#" are commented out
+ # the "x" form simply renames (hides) that setting. It's used
+ # to make it easy to disable a mult-line configuation setting.
+ #
+ # If you do not understand a setting then best to leave the default.
+ #
+ # Please follow the documentation (perldoc swish.cgi) and set up
+ # a test using the defaults before making changes. It's much easier
+ # to modify a working example than to try to get a modified example to
work ;)
+ #
+ # Again, this is a Perl hash structure. Commas are important.
+ #======================================================================
+
return {
- title => 'Search our site', # Title of your choice.
- swish_binary => $exec, # Location of swish-e binary
+ title => 'Search our site', # Title of your choice.
Displays on the search page
+ swish_binary => './swish-e', # Location of swish-e binary
+
# By default, this script tries to read a config file. You should
probably
# comment this out if not used save a disk stat
@@ -175,7 +146,7 @@
# If you have more than one index to search then specify an array
# reference. e.g. swish_index =>[ qw/ index1 index2 index3 /],
- swish_index => 'index.swish-e', # Location of your index file
+ swish_index => 'index.swish-e', # Location of your index file
# See "select_indexes" below
for how to
# select more than one index.
@@ -188,6 +159,8 @@
# But you can specify any PropertyName defined in your document.
# By default, swish will return the pathname for documents that do
not
# have a title.
+ # In other words, this is used for the text of the links of the
search results.
+ # <a href="prepend_path/swishdocpath">title_property</a>
title_property => 'swishtitle',
@@ -283,6 +256,9 @@
timeout => 10, # limit time used by swish when fetching
results - DoS protection.
+ max_query_length => 100, # limit length of query string. Swish
also has a limit (default is 40)
+ # You might want to set swish-e's limit
higher, and use this to get a
+ # somewhat more friendly message.
# These settings will use some crude highlighting code to highlight
search terms in the
@@ -337,7 +313,7 @@
#swish_index => [ qw/ index.swish-e index.other index2.other
index3.other index4.other / ],
Xselect_indexes => {
- #method => 'radio_group', # pico radio_group, popup_menu, or
checkbox_group
+ #method => 'radio_group', # pick radio_group, popup_menu, or
checkbox_group
method => 'checkbox_group',
#method => 'popup_menu',
columns => 3,
@@ -357,7 +333,7 @@
Xselect_by_meta => {
- #method => 'radio_group', # pico radio_group, popup_menu,
or checkbox_group
+ #method => 'radio_group', # pick: radio_group, popup_menu,
or checkbox_group
method => 'checkbox_group',
#method => 'popup_menu',
columns => 3,
@@ -409,12 +385,42 @@
},
+
+ # The "on_intranet" setting is just a flag that can be used to say
you do
+ # not have an external internet connection. It's here because the
default
+ # page generation includes links to images on swish-e.or and on
www.w3.org.
+ # If this is set to one then those images will not be shown.
+ # (This only effects the default ouput module TemplateDefault)
+
+ on_intranet => 0,
+
+
+
+ # Here you can hard-code debugging options. The will help you find
+ # where you made your mistake ;)
+ # Using all at once will generate a lot of messages to STDERR
+ # Please see the documentation before using these.
+ # Typically, you will set these from the command line instead of in
the configuration.
+
+ # debug_options => 'basic, command, headers, output, summary, dump',
+
+
+
# This defines the package object for reading CGI parameters
# Defaults to CGI. Might be useful with mod_perl.
# request_package => 'CGI',
# request_package => 'Apache::Request',
+
+ # Minor adjustment to page display. The page navigation normally
looks like:
+ # Page: 1 5 6 7 8 9 24
+ # where the first page and last page are always displayed. These
can be disabled by
+ # by setting to true values ( 1 )
+
+ no_first_page_navigation => 0,
+ no_last_page_navigation => 0,
+
@@ -458,6 +464,52 @@
}
+#^^^^^^^^^^^^^^^^^^^^^^^^^ end of user config
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#========================================================================================
+
+
+
+#=================================================================================
+# mod_perl entry point
+#
+# As an example, you might use a PerlSetVar to point to paths to different
+# config files, and then cache the different configurations by path.
+#
+#=================================================================================
+
+my %cached_configs;
+
+sub handler {
+ my $r = shift;
+
+ if ( my $config_path = $r->dir_config( 'Swish_Conf_File' ) ) {
+
+ # Already cached?
+ if ( $cached_configs{ $config_path } ) {
+ process_request( $cached_configs{ $config_path } );
+ return Apache::Constants::OK();
+ }
+
+ # Else, load config
+ my $config = default_config();
+ $config->{config_file} = $config_path;
+
+ # Merge with disk config file.
+ $cached_configs{ $config_path } = merge_read_config( $config );
+
+ process_request( $cached_configs{ $config_path } );
+ return Apache::Constants::OK();
+ }
+
+
+ # Otherwise, use hard-coded config
+ process_request( default_config() );
+
+ return Apache::Constants::OK();
+
+}
+
+
#============================================================================
# Read config settings from disk, and merge
# Note, all errors are ignored since by default this script looks for a
@@ -467,16 +519,82 @@
sub merge_read_config {
my $config = shift;
+ set_default_debug_flags();
+
+ set_debug($config); # get from config or from %ENV
+
return $config unless $config->{config_file};
my $return = do $config->{config_file};
return $config unless ref $return eq 'HASH';
+ if ( $config->{debug} || $return->{debug} ) {
+ require Data::Dumper;
+ print STDERR "\n---------- Read config parameters from
'$config->{config_file}' ------\n",
+ Data::Dumper::Dumper($return),
+ "-------------------------\n";
+ }
+
+ set_debug( $return );
+
+
# Merge settings
return { %$config, %$return };
}
+#--------------------------------------------------------------------------------------------------
+sub set_default_debug_flags {
+ # Debug flags defined
+
+ $SwishSearch::DEBUG_BASIC = 1; # Show command used to run swish
+ $SwishSearch::DEBUG_COMMAND = 2; # Show command used to run swish
+ $SwishSearch::DEBUG_HEADERS = 4; # Swish output headers
+ $SwishSearch::DEBUG_OUTPUT = 8; # Swish output besides headers
+ $SwishSearch::DEBUG_SUMMARY = 16; # Summary of results parsed
+ $SwishSearch::DEBUG_DUMP_DATA = 32; # dump data that is sent to
templating modules
+}
+
+
+
+
+#---------------------------------------------------------------------------------------------------
+sub set_debug {
+ my $conf = shift;
+
+ unless ( $ENV{SWISH_DEBUG} ||$conf->{debug_options} ) {
+ $conf->{debug} = 0;
+ return;
+ }
+
+ my %debug = (
+ basic => [$SwishSearch::DEBUG_BASIC, 'Basic debugging'],
+ command => [$SwishSearch::DEBUG_COMMAND, 'Show command used to
run swish'],
+ headers => [$SwishSearch::DEBUG_HEADERS, 'Show headers returned
from swish'],
+ output => [$SwishSearch::DEBUG_OUTPUT, 'Show output from
swish'],
+ summary => [$SwishSearch::DEBUG_SUMMARY, 'Show summary of
results'],
+ dump => [$SwishSearch::DEBUG_DUMP_DATA, 'Show all data
available to templates'],
+ );
+
+
+ $conf->{debug} = 1;
+
+ for ( split /\s*,\s*/, $ENV{SWISH_DEBUG} ) {
+ if ( exists $debug{ lc $_ } ) {
+ $conf->{debug} |= $debug{ lc $_ }->[0];
+ next;
+ }
+
+ print STDERR "Unknown debug option '$_'. Must be one of:\n",
+ join( "\n", map { sprintf(' %10s: %10s', $_, $debug{$_}->[1])
} sort { $debug{$a}->[0] <=> $debug{$b}->[0] }keys %debug),
+ "\n\n";
+ exit;
+ }
+
+ print STDERR "Debug level set to: $conf->{debug}\n";
+}
+
+
#============================================================================
#
# This is the main entry point, where a config hash is passed in.
@@ -491,17 +609,64 @@
$request_package =~ s[::][/]g;
require "$request_package.pm";
+ my $request_object = $conf->{request_package} ?
$conf->{request_package}->new : CGI->new;
+
+ if ( $conf->{debug} ) {
+ print STDERR 'Enter a query [all]: ';
+ my $query = <STDIN>;
+ $query =~ tr/\r//d;
+ chomp $query;
+ unless ( $query ) {
+ print STDERR "Using 'not asdfghjklzxcv' to match all records\n";
+ $query = 'not asdfghjklzxcv';
+ }
+
+ $request_object->param('query', $query );
+
+ print STDERR 'Enter max results to display [1]: ';
+ my $max = <STDIN>;
+ chomp $max;
+ $max = 1 unless $max && $max =~/^\d+$/;
+
+ $conf->{page_size} = $max;
+ }
+
+
# create search object
my $search = SwishQuery->new(
config => $conf,
- request => ($conf->{request_package} ?
$conf->{request_package}->new : CGI->new),
+ request => $request_object,
);
# run the query
my $results = $search->run_query; # currently, results is the just the
$search object
+ if ( $conf->{debug} ) {
+ if ( $conf->{debug} & $SwishSearch::DEBUG_DUMP_DATA ) {
+ require Data::Dumper;
+ print STDERR "\n------------- Results structure passed to
template ------------\n",
+ Data::Dumper::Dumper( $results ),
+ "--------------------------\n";
+ } elsif ( $conf->{debug} & $SwishSearch::DEBUG_SUMMARY ) {
+ print STDERR "\n------------- Results Summary ------------\n";
+ if ( $results->{hits} ) {
+ require Data::Dumper;
+ print STDERR "Showing $results->{navigation}{showing} of
$results->{navigation}{hits}\n",
+ Data::Dumper::Dumper( $results->{_results} );
+ } else {
+ print STDERR "** NO RESULTS **\n";
+ }
+
+ print STDERR "--------------------------\n";
+ } else {
+ print STDERR ( ($results->{hits} ? "Found $results->{hits}
results\n" : "Failed to find any results\n" . $results->errstr . "\n" ),"\n" );
+ }
+ }
+
+
+
my $template = $conf->{template} || { package => 'TemplateDefault' };
my $package = $template->{package};
@@ -509,7 +674,21 @@
my $file = "$package.pm";
$file =~ s[::][/]g;
- require $file;
+ eval { require $file };
+ if ( $@ ) {
+ warn "$0 [EMAIL PROTECTED]";
+ print <<EOF;
+Content-Type: text/html
+
+<html>
+<head><title>Software Error</title></head>
+<body><h2>Software Error<h2><p>Please check error log</p></body>
+</html>
+EOF
+
+ exit;
+}
+
$package->show_template( $template, $results );
}
@@ -522,6 +701,10 @@
#==================================================================================================
use Carp;
+# Or use this instead -- PLEASE see perldoc CGI::Carp for details
+# <opinion>CGI::Carp doesn't help that much</opinion>
+#use CGI::Carp; # qw(fatalsToBrowser);
+
#--------------------------------------------------------------------------------
# new() doesn't do much, just create the object
@@ -626,7 +809,6 @@
my $conf = $self->{config};
-
# Sets the query string, and any -L limits.
return $self unless $self->build_query;
@@ -656,24 +838,20 @@
- # Trap the call - not portable.
-
- my $timeout = $self->config('timeout');
-
- if ( $timeout ) {
- eval {
- local $SIG{ALRM} = sub { die "Timed out\n" };
- alarm ( $self->config('timeout') || 5 );
- $self->run_swish;
- alarm 0;
- };
+ my $timeout = $self->config('timeout') || 0;
- if ( $@ ) {
- $self->errstr( $@ );
- return $self;
- }
- } else {
+ eval {
+ local $SIG{ALRM} = sub { die "Timed out\n" };
+ alarm $timeout if $timeout && $^O !~ /Win32/i;
$self->run_swish;
+ alarm 0 unless $^O =~ /Win32/i;
+ waitpid $self->{pid}, 0 if $self->{pid}; # for IPC::Open2
+ };
+
+ if ( $@ ) {
+ warn "$0 $@"; # if $conf->{debug};
+ $self->errstr( "Service currently unavailable" );
+ return $self;
}
@@ -764,7 +942,9 @@
$self->errstr('Please enter a query string') if $q->param('submit');
return;
}
- if ( length( $query ) > 100 ) {
+
+
+ if ( length( $query ) > $self->{config}{max_query_length} ) {
$self->errstr('Please enter a shorter query');
return;
}
@@ -871,9 +1051,13 @@
eval { require DateRanges };
if ( $@ ) {
- $self->errstr( $@ );
+ print STDERR "\n------ Can't use DateRanges feature ------------\n",
+ "\nScript will run, but you can't use the date range
feature\n",
+ $@,
+ "\n--------------\n" if $conf->{debug};
+
delete $conf->{date_ranges};
- return;
+ return 1;
}
my $q = $self->{q};
@@ -931,15 +1115,17 @@
# Now set sort option - if a valid option submitted (or you could let
swish-e return the error).
my %sorts = map { $_, 1 } @$sorts_array;
- if ( $q->param('sort') && $sorts{ $q->param('sort') } ) {
+ my $sortby = $q->param('sort') || 'swishrank';
+
+ if ( $sortby && $sorts{ $sortby } ) {
- my $direction = $q->param('sort') eq 'swishrank'
+ my $direction = $sortby eq 'swishrank'
? $q->param('reverse') ? 'asc' : 'desc'
: $q->param('reverse') ? 'desc' : 'asc';
- $self->swish_command( '-s', $q->param('sort'), $direction );
+ $self->swish_command( '-s', $sortby, $direction );
- if ( $conf->{secondary_sort} && $q->param('sort') ne
$conf->{secondary_sort}[0] ) {
+ if ( $conf->{secondary_sort} && $sortby ne
$conf->{secondary_sort}[0] ) {
$self->swish_command(ref $conf->{secondary_sort} ? @{
$conf->{secondary_sort} } : $conf->{secondary_sort} );
}
@@ -1017,8 +1203,8 @@
}
@pages = $current_page..$current_page + $max_pages - 1;
- unshift @pages, 0 if $current_page;
- push @pages, $pages unless $current_page + $max_pages - 1 ==
$pages;
+ unshift @pages, 0 if $current_page &&
!$self->{config}{no_first_page_navigation};
+ push @pages, $pages unless $current_page + $max_pages - 1 ==
$pages || $self->{config}{no_last_page_navigation}
}
@@ -1080,7 +1266,6 @@
# or possibly a scalar with an error message.
#
-use Symbol;
sub run_swish {
@@ -1091,8 +1276,6 @@
my $conf = $self->{config};
my $q = $self->{q};
-
-
my @properties;
my %seen;
@@ -1116,15 +1299,10 @@
$self->swish_command( -x => join( '\t', map { "<$_>" } @properties ) .
'\n' );
$self->swish_command( -H => 9 );
- # Run swish
- my $fh = gensym;
- my $pid = open( $fh, '-|' );
+ my $fh = $^O =~ /Win32/i
+ ? windows_fork( $conf, $self )
+ : real_fork( $conf, $self );
- die "Failed to fork: $!\n" unless defined $pid;
-
- if ( !$pid ) { # in child
- exec $self->{prog}, $self->swish_command or die "Failed to exec
'$self->{prog}' Error:$!";
- }
$self->{COMMAND} = join ' ', $self->{prog}, $self->swish_command;
@@ -1142,13 +1320,20 @@
# Loop through values returned from swish.
my %stops_removed;
-
+
+ my $unknown_output = '';
+
+
while (<$fh>) {
chomp;
+ tr/\r//d;
# This will not work correctly with multiple indexes when different
values are used.
if ( /^# ([^:]+):\s+(.+)$/ ) {
+
+ print STDERR "$_\n" if $conf->{debug} &
$SwishSearch::DEBUG_HEADERS;
+
my $h = lc $1;
my $value = $2;
$self->{_headers}{$h} = $value;
@@ -1156,12 +1341,18 @@
push @{$self->{_headers}{'removed stopwords'}}, $value if $h eq
'removed stopword' && !$stops_removed{$value}++;
next;
+ } elsif ( $conf->{debug} & $SwishSearch::DEBUG_OUTPUT ) {
+ print STDERR "$_\n";
}
+
- # return errors as text
+ # return swish errors as a mesage to the script
$self->errstr($1), return if /^err:\s*(.+)/;
+ # Or, if you want to log the errors and just say "Service
Unavailable" use this:
+ #die "$1\n" if /^err:\s*(.+)/;
+
# Found a result
if ( /^\d/ ) {
@@ -1189,7 +1380,8 @@
eval { require "$package.pm" };
if ( $@ ) {
- $self->errstr( $@ );
+ $self->errstr( "Failed to load Highlighting Module -
check error log" );
+ warn "$0: $@";
$highlight = '';
next;
} else {
@@ -1216,19 +1408,93 @@
$h{$trim_prop} = substr( $h{$trim_prop}, 0, $max) . '
<b>...</b>';
}
}
+
+ next;
+ } elsif ( /^\.$/ ) {
+ last;
+
+ } else {
+ next if /^#/;
}
- # Might check for "\n." for end of results.
+ $unknown_output .= "'$_'\n";
+
+
}
+ die "Swish returned unknown output: $unknown_output\n" if
$unknown_output;
+
$self->{hits} = @results;
$self->{_results} = [EMAIL PROTECTED] if @results;
}
+#==================================================================
+# Run swish-e by forking
+#
+
+use Symbol;
+
+sub real_fork {
+ my ( $conf, $self ) = @_;
+
+
+ # Run swish
+ my $fh = gensym;
+ my $pid = open( $fh, '-|' );
+
+ die "Failed to fork: $!\n" unless defined $pid;
+
+
+
+ if ( !$pid ) { # in child
+ if ( $conf->{debug} & $SwishSearch::DEBUG_COMMAND ) {
+ print STDERR "---- Running swish with the following command and
parameters ----\n";
+ print STDERR join( " \\\n", map { /[^\/.\-\w\d]/ ? qq['$_'] :
$_ } $self->{prog}, $self->swish_command );
+ print STDERR
"\n-----------------------------------------------\n";
+ }
+
+
+ unless ( exec $self->{prog}, $self->swish_command ) {
+ warn "Child process Failed to exec '$self->{prog}' Error: $!";
+ print "Failed to exec Swish"; # send this message to parent.
+ exit;
+ }
+ }
+
+ return $fh;
+}
+
+
+#=====================================================================================
+# Windows work around
+# from perldoc perlfok -- na, that doesn't work. Try IPC::Open2
+#
+sub windows_fork {
+ my ( $conf, $self ) = @_;
+
+ if ( $conf->{debug} & $SwishSearch::DEBUG_COMMAND ) {
+ print STDERR "---- Running swish with the following command and
parameters ----\n";
+ print STDERR join( ' ', map { /[^.\-\w\d]/ ? qq["$_"] : $_ } map {
s/"/\\"/g; $_ } $self->{prog}, $self->swish_command );
+ print STDERR "\n-----------------------------------------------\n";
+ }
+
+
+ require IPC::Open2;
+ my ( $rdrfh, $wtrfh );
+
+ # Ok, I'll say it. Windows sucks.
+ my @command = map { s/"/\\"/g; $_ } $self->{prog}, $self->swish_command;
+ my $pid = IPC::Open2::open2($rdrfh, $wtrfh, @command );
+
+
+ $self->{pid} = $pid;
+
+ return $rdrfh;
+}
#=====================================================================================
# This method parses out the query from the "Parsed words" returned by swish
@@ -1347,138 +1613,309 @@
=head1 DESCRIPTION
-C<swish.cgi> is an example CGI script for searching with the SWISH-E search
engine version 2.1-dev and above.
+C<swish.cgi> is a CGI script for searching with the SWISH-E search engine
version 2.1-dev and above.
It returns results a page at a time, with matching words from the source
document highlighted, showing a
few words of content on either side of the highlighted word.
-The standard configuration should work with most swish index files.
Customization of the parameters will be
+The script is highly configurable; you can search multiple (or selectable)
indexes, limit searches to
+part of the index, allow sorting by a number of different properties, limit
results to a date range, and so on.
+
+The standard configuration (i.e. not using a config file) should work with
most swish index files.
+Customization of the parameters will be
needed if you are indexing special meta data and want to search and/or
display the meta data. The
configuration can be modified by editing this script directly, or by using a
configuration file (.swishcgi.conf
by default).
+You are strongly encouraged to get the default configuration working before
making changes. Most problems
+using this script are the result of configuration modifications.
+
The script is modular in design. Both the highlighting code and output
generation is handled by modules, which
are included in the F<example/modules> directory. This allows for easy
customization of the output without
changing the main CGI script. A module exists to generate standard HTML
output. There's also modules and
-template examples to use with the popular templating systems HTML::Template
and Template-Toolkit. This allows
+template examples to use with the popular Perl templating systems
HTML::Template and Template-Toolkit. This allows
you to tightly integrate this script with the look of an existing
template-driven web site.
+HTML::Template and Template-Toolkit are available from the CPAN
(http://search.cpan.org).
This scipt can also run basically unmodified as a mod_perl handler,
providing much better performance than
running as a CGI script.
-Due to the forking nature of this program and its use of signals,
-this script probably will not run under Windows without some modifications.
-There's plan to change this soon.
+Please read the rest of the documentation. There's a C<DEBUGGING> section,
and a C<FAQ> section.
+
+This script should work on Windows, but security may be an issue.
+
+=head1 REQUIREMENTS
+
+You should be running a reasonably current version of Perl. 5.00503 or
above is recommended (anything older
+will not be supported).
+
+If you wish to use the date range feature you will need to install the
Date::Calc module. This is available
+from http://search.cpan.org.
=head1 INSTALLATION
-Installing a CGI application is dependent on your specific web server's
configuration.
-For this discussion we will assume you are using Apache in a typical
configuration. For example,
-a common location for the DocumentRoot is C</usr/local/apache/htdocs>. If
you are installing this
-on your shell account, your DocumentRoot might be C<~yourname/public_html>.
+Here's an example installation session. Please get a simple installation
working before modifying the
+configuration file. Most problems reported for using this script have been
due to improper configuration.
-For the sake of this example we will assume the following:
+The script's default settings are setup for initial testing. By default the
settings expect to find
+most files and the swish-e binary in the same directory as the script.
- /usr/local/apache/htdocs - Document root
- /usr/local/apache/cgi-bin - CGI directory
+For I<security> reasons, once you have tested the script you will want to
change settings to limit access
+to some of these files by the web server
+(either by moving them out of web space, or using access control such as
F<.htaccess>).
+An example of using F<.htaccess> on Apache is given below.
-=head2 Move the files to their locations
+It's expected that you have already unpacked the swish-e distribution
+and built the swish-e binary (if using a source distribution).
+
+Below is a (unix) session where we create a directory, move required files
into this directory, adjust
+permissions, index some documents, and symlink into the web server.
=over 4
-=item Copy the swish.cgi file to your CGI directory
+=item 1 Move required files into their own directory.
-Most web servers have a directory where CGI programs are kept.
-Copy the C<swish.cgi> perl script into that directory if this is the case on
your
-server. You will need to provide read
-and execute permisssions to the file. Exactly what permissions are needed
again depends on
-your specific configuration. For example, under Unix:
+This assumes that swish-e was unpacked and build in the ~/swish-e directory.
- chmod 0755 swish.cgi
+ ~ >mkdir swishdir
+ ~ >cd swishdir
+ ~/swishdir >cp ~/swish-e/example/swish.cgi .
+ ~/swishdir >cp -rp ~/swish-e/example/modules .
+ ~/swishdir >cp ~/swish-e/src/swish-e .
+ ~/swishdir >chmod 755 swish.cgi
+ ~/swishdir >chmod 644 modules/*
-This gives the file owner (that's you) write access, and everyone read and
execute access.
-Note that you are not required to use a cgi-bin directory with Apache. You
may place the
-CGI script in any directory accessible via the web server and
-enable it as a CGI script with something like the following
-(place either in httpd.conf or in .htaccess):
+=item 2 Create an index
- <Files swish.cgi>
- Allow from all
- SetHandler cgi-script
- Options +ExecCGI
- </Files>
+This step you will create a simple configuration file. In this example the
Apache documentation
+is indexed. Last we run a simple query to test swish.
-Using this method you don't even need to use the C<.cgi> extension. For
example, rename
-the script to "search" and then use that in the C<Files> directive. Take to
your web
-administrator for further information.
-
-=item Copy the modules directory
-
-Copying the modules directory is optional, but the script needs to find
additional modules so you will
-need to edit the script to point to the modules directory. Unlike CPAN
modules that need to
-be uncompressed, built, and installed, all you need to do is make sure the
modules are some place where
-the web server can read them. You may decide to leave them where you
uncompressed the swish-e distribution,
-or you may wish to move them to your perl library.
+ ~/swishdir >cat swish.conf
+ IndexDir /usr/local/apache/htdocs
+ IndexOnly .html .htm
+ DefaultContents HTML
+ StoreDescription HTML <body> 200000
+ MetaNames swishdocpath swishtitle
-=head1 CONFIGURATION
+ ~/swishdir >./swish-e -c swish.conf
+ Indexing Data Source: "File-System"
+ Indexing "/usr/local/apache/htdocs"
+ Removing very common words...
+ no words removed.
+ Writing main index...
+ Sorting words ...
+ Sorting 7005 words alphabetically
+ Writing header ...
+ Writing index entries ...
+ Writing word text: Complete
+ Writing word hash: Complete
+ Writing word data: Complete
+ 7005 unique words indexed.
+ 5 properties sorted.
+ 124 files indexed. 1485844 total bytes. 171704 total words.
+ Elapsed time: 00:00:02 CPU time: 00:00:02
+ Indexing done!
+
+Now, verify that the index can be searched:
+
+ ~/swishdir >./swish-e -w install -m 1
+ # SWISH format: 2.1-dev-25
+ # Search words: install
+ # Number of hits: 14
+ # Search time: 0.001 seconds
+ # Run time: 0.040 seconds
+ 1000 /usr/local/apache/htdocs/manual/dso.html "Apache 1.3 Dynamic Shared
Object (DSO) support" 17341
+ .
+
+Let's see what files we have in our directory now:
+
+ ~/swishdir >ls -1 -F
+ index.swish-e
+ index.swish-e.prop
+ modules/
+ swish-e*
+ swish.cgi*
+ swish.conf
-=head2 Configure the swish.cgi program
+=item 3 Test the CGI script
-Use a text editor and open the C<swish.cgi> program.
+This is a simple step, but often overlooked. You should test from the
command line instead of jumping
+ahead and testing with the web server. See the C<DEBUGGING> section below
for more information.
-=over 4
+ ~/swishdir >./swish.cgi | head
+ Content-Type: text/html; charset=ISO-8859-1
+
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+ <html>
+ <head>
+ <title>
+ Search our site
+ </title>
+ </head>
+ <body>
+
+The above shows that the script can be run directly, and generates a correct
HTTP header and HTML.
-=item 1 Check the C<shebang> line
+If you run the above and see something like this:
-The first line of the program must point to the location of your perl
program. Typical
-examples are:
+ ~/swishdir >./swish.cgi
+ bash: ./swish.cgi: No such file or directory
+then you probably need to edit the script to point to the correct location
of your perl program.
+Here's one way to find out where perl is located (again, on unix):
+
+ ~/swishdir >which perl
+ /usr/local/bin/perl
+
+ ~/swishdir >/usr/local/bin/perl -v
+ This is perl, v5.6.0 built for i586-linux
+ ...
+
+Good! We are using a reasonably current version of perl. You should be
running
+at least perl 5.005 (5.00503 really). You will may have problems otherwise.
+
+Now that we know perl is at F</usr/local/bin/perl> we can adjust the
"shebang" line
+in the perl script (e.g. the first line of the script):
+
+ ~/swishdir >pico swish.cgi
+ (edit the #! line)
+ ~/swishdir >head -1 swish.cgi
#!/usr/local/bin/perl -w
- #!/usr/bin/perl -w
- #!/opt/perl/bin/perl -w
-=item 2 Set the perl library path
+=item 4 Test with your web server
-The script must find the modules that the script is distributed with. These
modules handle
-the highlighting of the search terms, and the output generation. Again,
where you place the
-modules is up to you, and the only requirement is that the web server can
access those files.
+How you do this is completely dependent on your web server, and you may need
to talk to your web
+server admin to get this working. Often files with the .cgi extension are
automatically set up to
+run as CGI scripts, but not always. In other words, this step is really up
to you to figure out!
-You tell perl the location of the modules with the "use lib" directive. The
default for this script is:
+First, I create a symlink in Apache's document root to point to my test
directory "swishdir". This will work
+because I know my Apache server is configured to follow symbolic links.
- use lib qw( modules );
+ ~/swishdir >su -c 'ln -s /home/bill/swishdir
/usr/local/apache/htdocs/swishdir'
+ Password: *********
-This says to look for the modules in the F<modules> directory of the current
directory.
+If your account is on an ISP and your web directory is F<~/public_html> the
you might just move the entire
+directory:
-For example, say you want to leave the modules where you unpacked the
swish-e distribution. If
-you unpacked in your home directory of F</home/yourname/swish-e> then you
must add this to the
-script:
+ mv ~/swishdir ~/public_html
- use lib qw( /home/yourname/swish-e/example/modules );
+Now, let's make a real HTTP request. I happen to have Apache setup on a
local port:
-
+ ~/swishdir >GET http://localhost:8000/swishdir/swish.cgi | head -3
+ #!/usr/local/bin/perl -w
+ package SwishSearch;
+ use strict;
-=item 3 Set the configuration parameters
+Oh, darn. It looks like Apache is not running the script and instead
returning it as a
+static page. I need to tell Apache that swish.cgi is a CGI script.
-To make things somewhat simple, the configuration parameters are included at
the top of the program.
-The parameters are all part of a perl C<hash> structure, and the comments at
the top of the program should
-get you going.
+In my case F<.htaccess> comes to the rescue:
+
+ ~/swishdir >cat .htaccess
+
+ # Deny everything by default
+ Deny From All
+
+ # But allow just CGI script
+ <files swish.cgi>
+ Options ExecCGI
+ Allow From All
+ SetHandler cgi-script
+ </files>
+
+Let's try the request one more time:
+
+ ~/swishdir >GET http://localhost:8000/swishdir/swish.cgi | head
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+ <html>
+ <head>
+ <title>
+ Search our site
+ </title>
+ </head>
+ <body>
+ <h2>
+ <a href="http://swish-e.org">
+
+That looks better! Now use your web browser to test.
+
+Make sure you look at your web server's error log file while testing the
script.
+
+BTW - "GET" is a program included with Perl's LWP library. If you do no
have this you might
+try something like:
+
+ wget -O - http://localhost:8000/swishdir/swish.cgi | head
+
+and if nothing else, you can always telnet to the web server and make a
basic request.
+
+ ~/swishtest > telnet localhost 8000
+ Trying 127.0.0.1...
+ Connected to localhost.
+ Escape character is '^]'.
+ GET /swishtest/swish.cgi http/1.0
+
+ HTTP/1.1 200 OK
+ Date: Wed, 13 Feb 2002 20:14:31 GMT
+ Server: Apache/1.3.20 (Unix) mod_perl/1.25_01
+ Connection: close
+ Content-Type: text/html; charset=ISO-8859-1
+
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+ <html>
+ <head>
+ <title>
+ Search our site
+ </title>
+ </head>
+ <body>
+
+This may seem like a lot of work compared to using a browser, but browsers
+are a poor tool for basic CGI debugging.
+
+
+=back
+
+If you have problems check the C<DEBUGGING> section below.
-You will probably need to specify at least the location of the swish-e
binary, your index file or files,
-and a title.
+=head1 CONFIGURATION
+
+If you want to change the location of the swish-e binary or the index file,
use multiple indexes, add additional metanames and properties,
+change the default highlighting behavior, etc., you will need to adjust the
script's configuration settings.
+
+Please get a test setup working with the default parameters before making
changes to any configuration settings.
+Better to debug one thing at a time...
+
+In general, you will need to adjust the script's settings to match the index
file you are searching. For example,
+if you are indexing a hypermail list archive you may want to make the script
+use metanames/properties of Subject, Author, and, Email address. Or you may
wish to provide a way to limit
+searches to parts of your index file (e.g. parts of your directory tree).
+
+To make things somewhat "simple", the configuration parameters are included
near the top of the swish.cgi program.
+That is the only place that the individual parameters are defined and
explained, so you will need to open up
+the swish.cgi script in an editor to view the options. Further questions
about individual settings should
+be referred to the swish-e discussion list.
-You have two options for changing the configuration settings from their
default:
+The parameters are all part of a perl C<hash> structure, and the comments at
the top of the program should
+get you going. The perl hash structure may seem a bit confusing, but it
makes it easy to create nested and complex
+parameters.
+
+You have two options for changing the configuration settings from their
default values:
you may edit the script directly, or you may use a configuration file. In
either case, the configuration
settings are a basic perl hash reference.
-Using a configuration file is described below.
+Using a configuration file is described below, but contains the same hash
structure.
-The configuration settings might look like:
+There are many configuration settings, and some of them are commented out
either by using
+a "#" symbol, or by simply renaming the configuration directive (e.g. by
adding an "x" to the parameter
+name).
+
+A very basic configuration setup might look like:
return {
title => 'Search the Swish-e list', # Title of your
choice.
swish_binary => './swish-e', # Location of
swish-e binary
- swish_index => '../index.swish-e', # Location of your
index file
+ swish_index => 'index.swish-e', # Location of your
index file
};
Or if searching more than one index:
@@ -1486,31 +1923,25 @@
return {
title => 'Search the Swish-e list',
swish_binary => './swish-e',
- swish_index => ['../index.swish-e', '../index2'],
+ swish_index => ['index.swish-e', 'index2'],
};
-Both of these examples return a reference to a perl hash ( C<return {...}>
). Again, this same format is
-used either at the top of this program, or in a configuration file.
-
-The examples above place the swish index file(s)
-in the directory above the C<swish.cgi> CGI script. If using the example
paths above
-of C</usr/local/apache/cgi-bin> for the CGI bin directory, that means that
the index file
-is in C</usr/local/apache>. That places the index out of web space (e.g.
cannot be accessed
-via the web server), yet relative to where the C<swish.cgi> script is
located.
+Both of these examples return a reference to a perl hash ( C<return {...}>
). In the second example,
+the multiple index files are set as an array reference.
-(If running under mod_perl you will most likely specify absolute paths for
your index files.)
+Note that in the example above the swish-e binary file is relative to the
current directory.
+If running under mod_perl you will typically need to use absolute paths.
-There's more than one way to do it, of course.
-One option is to place the index in the same directory as the <swish.cgi>
script, but
-then be sure to use your web server's configuration to prohibit access to
the index directly.
+B<Using A Configuration File>
-Another common option is to maintain a separate directory of the all your
swish index files. This decision is
-up to you.
-
-As mentioned above, you can either edit this script directly and modify the
configuration settings, or
+As mentioned above, you can either edit the F<swish.cgi> script directly and
modify the configuration settings, or
use an external configuration file. The settings in the configuration file
are merged with (override)
the settings defined in the script.
+The advantage of using a configuration script is that you are not editing
the swish.cgi script directly, and
+downloading a new version won't mean re-editing the cgi script. Also, if
running under mod_perl you can use the same
+script loaded into Apache to manage many different search pages.
+
By default, the script will attempt to read from the file F<.swishcgi.conf>.
For example, you might only wish to change the title used
in the script. Simply create a file called F<.swishcgi.conf> in the same
directory as the CGI script:
@@ -1521,62 +1952,19 @@
title => 'Search Our Mailing List Archive',
};
-Look at the default configuration settings at the top of this program for
information on the available settings.
-
-=item 4 Create your index
-
-You must index your web site before you can begin to use the C<swish.cgi>
script.
-Create a configuration file called C<swish.conf> in the directory where you
will store
-the index file.
-
-This next example uses the file system to index your web documents.
-In general, you will probably wish to I<spider> your web site if your web
pages do not
-map exactly to your file system, and to only index files available from
links on you web
-site.
-
-See B<Spidering> below for more information.
-
-Example C<swish.conf> file:
-
- # Define what to index
- IndexDir /usr/local/apache/htdocs
- IndexOnly .html .htm
-
- # Tell swish how to parse .html and .html documents
- IndexContents HTML .html .htm
- # And just in case we have files without an extension
- DefaultContents HTML
-
- # Replace the path name with a URL
- ReplaceRules replace /usr/local/apache/htdocs/ http://www.myserver.name/
-
- # Allow limiting search to titles and URLs.
- MetaNames swishdocpath swishtitle
-
- # Optionally use stemming for "fuzzy" searches
- #UseStemming yes
-
-Now to index you simply run:
-
- swish-e -c swish.conf
-
-The default index file C<index.swish-e> will be placed in the current
directory.
-
-Note that the above swish-e configuration defines two MetaNames
"swishdocpath" and "swishtitle".
-This allows searching just the document path or the title instead of the
document's content.
+The settings you use will depend on the index you create with swish. Here's
a basic configuration:
-Here's an expanded C<swish.cgi> configuration to make use of the above
settings used while indexing:
-
- return {
+ return {
title => 'Search the Apache documentation',
swish_binary => './swish-e',
swish_index => 'index.swish-e',
metanames => [qw/swishdefault swishdocpath swishtitle/],
- display_props => [qw/swishlastmodified swishdocsize swishdocpath/],
- title_property => 'swishtitle', # Not required, but recommended
+ display_props => [qw/swishtitle swishlastmodified swishdocsize
swishdocpath/],
+ title_property => 'swishdocpath',
+ prepend_path => 'http://myhost/apachedocs',
name_labels => {
- swishdefault => 'Body & Title',
+ swishdefault => 'Search All',
swishtitle => 'Title',
swishrank => 'Rank',
swishlastmodified => 'Last Modified Date',
@@ -1595,56 +1983,365 @@
The parameter "name_labels" is a hash (reference)
that is used to give friendly names to the metanames.
-Swish-e can store part of all of the contents of the documents as they are
indexed, and this
-"document description" can be returned with search results.
+Here's another example. Say you want to search either (or both) the Apache
1.3 documentation or the
+Apache 2.0 documentation:
+
+ return {
+ title => 'Search the Apache Documentation',
+ date_ranges => 0,
+ swish_index => [ qw/ index.apache index.apache2 / ],
+ select_indexes => {
+ method => 'checkbox_group',
+ labels => [ '1.3.23 docs', '2.0 docs' ], # Must match up
one-to-one to swish_index
+ description => 'Select: ',
+ },
+
+ };
+
+Now you can select either or both sets of documentation while searching.
+
+
+Please refer to the default configuration settings near the top of the
script for details on
+the available settings.
+
+=head1 DEBUGGING
+
+Most problems with using this script have been a result of improper
configuration. Please
+get the script working with default settings before adjusting the
configuration settings.
+
+The key to debugging CGI scripts is to run them from the command line, not
with a browser.
+
+First, make sure the program compiles correctly:
+
+ > perl -c swish.cgi
+ swish.cgi syntax OK
+
+Next, simply try running the program:
+
+ > ./swish.cgi | head
+ Content-Type: text/html; charset=ISO-8859-1
+
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+ <html>
+ <head>
+ <title>
+ Search our site
+ </title>
+ </head>
+ <body>
+
+Now, you know that the program compiles and will run from the command line.
+Next, try accessing the script from a web browser.
+
+If you see the contents of the CGI script instead of its output then your
web server is
+not configured to run the script. You will need to look at settings like
ScriptAlias, SetHandler,
+and Options.
+
+If an error is reported (such as Internal Server Error or Forbidden)
+you need to locate your web server's error_log file
+and carefully read what the problem is. Contact your web administrator for
help.
+
+If you don't have access to the web server's error_log file, you can modify
the script to report
+errors to the browser screen. Open the script and search for "CGI::Carp".
(Author's suggestion is
+to debug from the command line -- adding the browser and web server into the
equation only complicates
+debugging.)
+
+The script does offer some basic debugging options that allow debugging from
the command line.
+The debugging options are enabled by setting
+an environment variable "SWISH_DEBUG". How that is set depends on your
operating system and the
+shell you are using. These examples are using the "bash" shell syntax.
+
+Note: You can also use the "debug_options" configuration setting, but the
recommended method
+is to set the environment variable.
+
+You can list the available debugging options like this:
+
+ >SWISH_DEBUG=help ./swish.cgi >outfile
+ Unknown debug option 'help'. Must be one of:
+ basic: Basic debugging
+ command: Show command used to run swish
+ headers: Show headers returned from swish
+ output: Show output from swish
+ summary: Show summary of results
+ dump: Show all data available to templates
+
+As you work yourself down the list you will get more detail output. You can
combine
+options like:
+
+ >SWISH_DEBUG=command,headers,summary ./swish.cgi >outfile
+
+You will be asked for an input query and the max number of results to
return. You can use the defaults
+in most cases. It's a good idea to redirect output to a file. Any error
messages are sent to stderr, so
+those will still be displayed (unless you redirect stderr, too).
+
+Here are some examples:
+
+ ~/swishtest >SWISH_DEBUG=basic ./swish.cgi >outfile
+ Debug level set to: 1
+ Enter a query [all]:
+ Using 'not asdfghjklzxcv' to match all records
+ Enter max results to display [1]:
+
+ ------ Can't use DateRanges feature ------------
+
+ Script will run, but you can't use the date range feature
+ Can't locate Date/Calc.pm in @INC (@INC contains: modules
/usr/local/lib/perl5/5.6.0/i586-linux /usr/local/lib/perl5/5.6.0
/usr/local/lib/perl5/site_perl/5.6.0/i586-linux
/usr/local/lib/perl5/site_perl/5.6.0
/usr/local/lib/perl5/site_perl/5.005/i586-linux
/usr/local/lib/perl5/site_perl/5.005 /usr/local/lib/perl5/site_perl .) at
modules/DateRanges.pm line 107, <STDIN> line 2.
+ BEGIN failed--compilation aborted at modules/DateRanges.pm line 107,
<STDIN> line 2.
+ Compilation failed in require at ./swish.cgi line 971, <STDIN> line 2.
+
+ --------------
+ Can't exec "./swish-e": No such file or directory at ./swish.cgi line
1245, <STDIN> line 2.
+ Child process Failed to exec './swish-e' Error: No such file or
directory at ./swish.cgi line 1246, <STDIN> line 2.
+ Failed to find any results
+
+The above told me about two problems. First, it's telling me that the
Date::Calc module is not installed.
+The Date::Calc module is needed to use the date limiting feature of the
script.
+
+The second problem is a bit more serious. It's saying that the script can't
find the swish-e binary file.
+I simply forgot to copy it.
+
+ ~/swishtest >cp ~/swish-e/src/swish-e .
+ ~/swishtest >cat .swishcgi.conf
+ return {
+ title => 'Search the Apache Documentation',
+ date_ranges => 0,
+ };
+
+Now, let's try again:
- # Store the text of the documents within the swish index file
- StoreDescription HTML <body> 100000
+ ~/swishtest >SWISH_DEBUG=basic ./swish.cgi >outfile
+ Debug level set to: 1
-Adding the above to your C<swish.conf> file tells swish-e to store up to
100,000 characters from the body of each document within the
-swish-e index. To display this information in search results, highlighting
search terms,
-use the follow configuration in C<swish.cgi>:
+ ---------- Read config parameters from '.swishcgi.conf' ------
+ $VAR1 = {
+ 'date_ranges' => 0,
+ 'title' => 'Search the Apache Documentation'
+ };
+ -------------------------
+ Enter a query [all]:
+ Using 'not asdfghjklzxcv' to match all records
+ Enter max results to display [1]:
+ Found 1 results
+
+ Can't locate TemplateDefault.pm in @INC (@INC contains: modules
/usr/local/lib/perl5/5.6.0/i586-linux /usr/local/lib/perl5/5.6.0
/usr/local/lib/perl5/site_perl/5.6.0/i586-linux
/usr/local/lib/perl5/site_perl/5.6.0
/usr/local/lib/perl5/site_perl/5.005/i586-linux
/usr/local/lib/perl5/site_perl/5.005 /usr/local/lib/perl5/site_perl .) at
./swish.cgi line 608.
+
+Bother. I fixed the first two problems, but now there's this new error.
Oh, I somehow forgot to
+copy the modules directory. The obvious way to fix that is to copy the
directory. But, there may
+be times where you want to put the module directory in another location.
So, let's modify the
+F<.swishcgi.conf> file and add a "use lib" setting:
+
+ ~/swishtest >cat .swishcgi.conf
+ use lib '/home/bill/swish-e/example/modules';
return {
- title => 'Search the Apache documentation',
- swish_binary => './swish-e',
- swish_index => 'index.swish-e',
- metanames => [qw/swishdefault swishdocpath swishtitle/],
- display_props => [qw/swishlastmodified swishdocsize swishdocpath/],
- title_property => 'swishtitle', # Not required, but recommended
- description_prop=> 'swishdescription',
+ title => 'Search the Apache Documentation',
+ date_ranges => 0,
+ };
- name_labels => {
- swishdefault => 'Body & Title',
- swishtitle => 'Title',
- swishrank => 'Rank',
- swishlastmodified => 'Last Modified Date',
- swishdocpath => 'Document Path',
- swishdocsize => 'Document Size',
+ ~/swishtest >SWISH_DEBUG=basic ./swish.cgi >outfile
+ Debug level set to: 1
+
+ ---------- Read config parameters from '.swishcgi.conf' ------
+ $VAR1 = {
+ 'date_ranges' => 0,
+ 'title' => 'Search the Apache Documentation'
+ };
+ -------------------------
+ Enter a query [all]:
+ Using 'not asdfghjklzxcv' to match all records
+ Enter max results to display [1]:
+ Found 1 results
+
+Now were talking.
+
+Here's a common problem. Everything checks out, but when you run the script
you see the message:
+
+ Swish returned unknown output
+
+Ok, let's find out what output it is returning:
+
+ ~/swishtest >SWISH_DEBUG=headers,output ./swish.cgi >outfile
+ Debug level set to: 13
+
+ ---------- Read config parameters from '.swishcgi.conf' ------
+ $VAR1 = {
+ 'swish_binary' => '/usr/local/bin/swish-e',
+ 'date_ranges' => 0,
+ 'title' => 'Search the Apache Documentation'
+ };
+ -------------------------
+ Enter a query [all]:
+ Using 'not asdfghjklzxcv' to match all records
+ Enter max results to display [1]:
+ usage: swish [-i dir file ... ] [-S system] [-c file] [-f file] [-l]
[-v (num)]
+ ...
+ version: 2.0
+ docs: http://sunsite.berkeley.edu/SWISH-E/
+
+ *** 9872 Failed to run swish: 'Swish returned unknown output' ***
+ Failed to find any results
+
+Oh, looks like /usr/local/bin/swish-e is version 2.0 of swish. We need
2.1-dev and above!
+
+=head1 Frequently Asked Questions
+
+Here's some common questions and answers.
+
+=head2 How do I change the way the output looks?
+
+The script uses a module to generate output. By default it uses the
TemplateDefault.pm module.
+The module used can be selected in the configuration file.
+
+If you want to make simple changes you can edit the TemplatDefault.pm module
directly. If you want to
+copy a module, you must also change the "package" statement at the top of
the module. For example:
+
+ cp TempateDefault.pm MyTemplateDefault.pm
+
+Then at the top of the module adjust the "package" line to:
+
+ package MyTemplateDefault;
+
+To use this modules you need to adjust the configuration settings (either at
the top of F<swish.cgi> or in
+a configuration file:
+
+
+ template => {
+ package => 'MyTemplateDefault',
},
- highlight => {
- package => 'PhraseHighlight',
- meta_to_prop_map => { # this maps search metatags to display
properties
- swishdefault => [ qw/swishtitle swishdescription/ ],
- swishtitle => [ qw/swishtitle/ ],
- swishdocpath => [ qw/swishdocpath/ ],
+
+
+=head2 How do I use a templating system with swish.cgi?
+
+In addition to the TemplateDefault.pm module, the swish-e distribution
includes two other Perl modules for
+generating output using the templating systems HTML::Template and
Template-Toolkit.
+
+Templating systems use template files to generate the HTML, and make
maintaining the look of a large (or small) site
+much easier. HTML::Template and Template-Toolkit are separate packages and
can be downloaded from the CPAN.
+See http://search.cpan.org.
+
+Two basic templates are provided as examples for generating output using
these templating systems.
+The example templates are located in the F<example> directory.
+The module F<TemplateHTMLTemplate.pm> uses the file F<swish.tmpl> to
generate its output, while the
+module F<TemplateToolkit.pm> uses the F<search.tt> file.
+
+To use either of these modules you will need to adjust the "template"
configuration setting. Examples for
+both templating systems are provided in the configuration settings near the
top of the F<swish.cgi> program.
+
+Use of these modules is an advanced usage of F<swish.cgi> and are provided
as examples only.
+
+All of the output generation modules are passed a hash with the results from
the search, plus other data use to create the
+output page. You can see this hash by using the debugging option "dump" or
by using the TemplateDumper.pm
+module:
+
+ ~/swishtest >cat .swishcgi.conf
+ return {
+ title => 'Search the Apache Documentation',
+ template => {
+ package => 'TemplateDumper',
},
- }
+ };
- };
+And run a query. For example:
+ http://localhost:8000/swishtest/swish.cgi?query=install
-Other C<swish.cgi> configuration settings are available, and are listed at
the top of the F<swish.cgi>
-script.
+=head2 Why are there three different highlighting modules?
+Three are three highlighting modules included with the swish-e distribution.
+Each is a trade-off of speed vs. accuracy:
-=back
+ DefaultHighlight.pm - reasonably fast, but does not highlight phrases
+ PhraseHighlight.pm - reasonably slow, but is reasonably accurate
+ SimpleHighlight.pm - fast, some phrases, but least accurate
+
+Eh, the default is actually "PhraseHighlight.pm". Oh well.
+
+Optimizations to these modules are welcome!
+
+=head2 My ISP doesn't provide access to the web server logs
+
+There are a number of options. One way it to use the CGI::Carp module.
Search in the
+swish.cgi script for:
+
+ use Carp;
+ # Or use this instead -- PLEASE see perldoc CGI::Carp for details
+ # use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
+
+And change it to look like:
+
+ #use Carp;
+ # Or use this instead -- PLEASE see perldoc CGI::Carp for details
+ use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
-You should now be ready to run your search engine. Point your browser to:
+This should be only for debugging purposes, as if used in production you may
end up sending
+quite ugly and confusing messages to your browsers.
- http://www.myserver.name/cgi-bin/swish.cgi
+=head2 Why does the output show (NULL)?
+
+The most common reason is that you did not use StoreDescription in your
config file while indexing.
+
+ StoreDescription HTML <body> 200000
+
+That tells swish to store the first 200,000 characters of text extracted
from the body of each document parsed
+by the HTML parser. The text is stored as property "swishdescription".
Running:
+
+ ~/swishtest > ./swish-e -T index_metanames
+
+will display the properties defined in your index file.
+
+This can happen with other properties, too.
+For example, this will happen when you are asking for a property to display
that is not defined in swish.
+
+ ~/swishtest > ./swish-e -w install -m 1 -p foo
+ # SWISH format: 2.1-dev-25
+ # Search words: install
+ err: Unknown Display property name "foo"
+ .
+
+ ~/swishtest > ./swish-e -w install -m 1 -x 'Property foo=<foo>\n'
+ # SWISH format: 2.1-dev-25
+ # Search words: install
+ # Number of hits: 14
+ # Search time: 0.000 seconds
+ # Run time: 0.038 seconds
+ Property foo=(NULL)
+ .
+
+To check that a property exists in your index you can run:
+
+ ~/swishtest > ./swish-e -w not dkdk -T index_metanames | grep foo
+ foo : id=10 type=70 META_PROP:STRING(case:ignore) *presorted*
+
+Ok, in this case we see that "foo" is really defined as a property. Now
let's make sure F<swish.cgi>
+is asking for "foo" (sorry for the long lines):
+
+ ~/swishtest > SWISH_DEBUG=command ./swish.cgi > /dev/null
+ Debug level set to: 3
+ Enter a query [all]:
+ Using 'not asdfghjklzxcv' to match all records
+ Enter max results to display [1]:
+ ---- Running swish with the following command and parameters ----
+ ./swish-e \
+ -w \
+ 'swishdefault=(not asdfghjklzxcv)' \
+ -b \
+ 1 \
+ -m \
+ 1 \
+ -f \
+ index.swish-e \
+ -s \
+ swishrank \
+ desc \
+ swishlastmodified \
+ desc \
+ -x \
+
'<swishreccount>\t<swishtitle>\t<swishdescription>\t<swishlastmodified>\t<swishdocsize>\t<swishdocpath>\t<fos>\t<swishrank>\t<swishdocpath>\n'
\
+ -H \
+ 9
+
+If you look carefully you will see that the -x parameter has "fos" instead
of "foo", so there's our problem.
-adjusting the server and URL to match your system, of course.
=head1 MOD_PERL
@@ -1684,41 +2381,6 @@
Please post to the swish-e discussion list if you have any questions about
running this
script under mod_perl.
-
-=head1 DEBUGGING
-
-The key to debugging CGI scripts is to run them from the command line, not
with a browser.
-
-First, make sure the program compiles correctly:
-
- > perl -c swish.cgi
- swish.cgi syntax OK
-
-Next, simply try running the program:
-
- > ./swish.cgi | head
- Content-Type: text/html; charset=ISO-8859-1
-
- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
- <html>
- <head>
- <title>
- Search our site
- </title>
- </head>
- <body>
-
-Now, you know that the program compiles and will run from the command line.
-Next, try accessing the script from a web browser.
-
-If you see the contents of the CGI script instead of its output then your
web server is
-not configured to run the script. You will need to look at settings like
ScriptAlias, SetHandler,
-and Options.
-
-If an error is reported (such as Internal Server Error or Forbidden)
-you need to locate your web server's error_log file
-and carefully read what the problem is. Contact your web administrator for
help.
-
=head1 Spidering
@@ -1835,6 +2497,8 @@
See http://www.w3.org/Security/Faq/www-security-faq.html
+Security on Windows questionable.
+
=head1 SUPPORT
The SWISH-E discussion list is the place to ask for any help regarding
SWISH-E or this example
@@ -1844,11 +2508,11 @@
http://swish-e.org/2.2/docs/INSTALL.html#When_posting_please_provide_the_
-Please do not contact the author directly.
+Please do not contact the author or any of the swish-e developers directly.
=head1 LICENSE
-swish.cgi $Revision: 1.2 $ Copyright (C) 2001 Bill Moseley [EMAIL PROTECTED]
+swish.cgi $Revision: 1.3 $ Copyright (C) 2001 Bill Moseley [EMAIL PROTECTED]
Example CGI program for searching with SWISH-E
1.4 +16 -2 modperl-docs/src/search/swish.conf
Index: swish.conf
===================================================================
RCS file: /home/cvs/modperl-docs/src/search/swish.conf,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -r1.3 -r1.4
--- swish.conf 4 Feb 2002 09:22:27 -0000 1.3
+++ swish.conf 3 Mar 2002 11:27:22 -0000 1.4
@@ -1,5 +1,19 @@
IndexDir ./spider.pl
DefaultContents HTML2
StoreDescription HTML2 <body> 100000
-MetaNames swishtitle swishdocpath
-SwishProgParameters default http://localhost/modperl-site/
+MetaNames swishtitle swishdocpath section
+
+# This is to make the URLs shorter in the display.
+ReplaceRules remove http://perl.apache.org
+
+# For example, on my test setup I might do something like:
+# Need ".." since search is on level down
+
+ReplaceRules replace http://mardy:40994/dst_html ..
+
+
+UndefinedMetaTags ignore
+
+#BuzzWords in highlighting --
+#How about counting highlighted terms individually in the highlight module
+#so every term is highlighted at least once, with a total of say five.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]