Theoretical exercises in searching

Briggs, Gary Thu, 31 May 2001 02:09:00 -0700
That's the XML over.

I've got a few more random things about searching in general, mostly
theoretical stuff:

How do people index truly dynamic sites?

Here comes a mostly made-up example, but it illustrates my point quite well.
WARNING: Perl and bourne shell used in fair quantity in this e-mail.


First off, a really really simple CGI script.

<snip>
#!/bin/sh
echo -n "Content-type: text/plain\n\n"
exec fortune
</snip>

And that's it. It prints out a random fortune cookie. How do you add this to
the search engine? You'd only index on the words in the cookie at the time.
Useless. You'd be _very_ unlikely to be able to search for the word
"cookie". There's obviously something like my XML interface from the last
e-mail, but that doesn't use the spider, so isn't quite as unified as it
could be.
You could do it as an HTML document with meta tags, something more like:

<snip>
#!/opt/bin/perl
use CGI;
my $q = new CGI;

print $q->header;
print $q->start_html;
print "<META NAME="DESCRIPTION" CONTENT="This produces nothing but fortune
cookies">\n';
print "<META NAME="KEYWORDS" CONTENT="fortune,cookie">\n';
open FORTUNE,"fortune |";
while (<FORTUNE>)
{
        s/&/&amp;/go;
        s/</&lt;/go;
        s/>/&gt;/go;
        s/\s+$//go;
        s{^(\s+)} {'&nbsp;' x length($1)}gem;
        print $_."<BR>\n";
}
print $q->end_html;
exit 0
</snip>

Right. But surely that produces _far_ more heavyweight HTML than is
necessary? You're doubling the size of the page for a lot of the cookies you
might see.

And the spider still sees the fortune cookie itself and indexes it. If there
were offensive words in there, that'd be _really_ bad. I know, put it
between <!-- udmcomment --> tags or whatever it is, but that's just making
the HTML bigger again.

And I know that we're only talking small amount of data in the grand scheme
of things, but it does translate to bigger problems.

So, this is actually what my fortune cookie program currently is:

<snip>
#!/opt/bin/perl
use CGI;

my $q = new CGI;

print $q->header;
print $q->start_html(-title => "Second Spider Cloaking Test", -BGCOLOR =>
'#FFFFCC');

if ( $ENV{"HTTP_USER_AGENT"} =~ /UdmSearch/ )
{
        print "<META NAME=\"DESCRIPTION\" CONTENT=\"This produces nothing
but fortune cookies, and UDM has indexed on some keywords, but they're not
in the page if you go to it\">";
        print "<META NAME=\"KEYWORDS\"
CONTENT=\"chunky,kibbles,fortune,magic,cookie,machine\">";
}
else
{
        open FORTUNE, "/home/gbriggs/bin/fortune |";
        while (<FORTUNE>)
        {
                s/&/&amp;/go;
                s/</&lt;/go;
                s/>/&gt;/go;
                s/\s+$//go;
                s{^(\s+)} {'&nbsp;' x length($1)}gem;
                print $_."<BR>\n";
        }
}
print $q->end_html;
exit 0;
</snip>

It detects the user agent and only shows it things I want it to see.
Interestingly, this has other uses; you can show it scraps of HTML of the
form
<A HREF="somewhere">Crossword</A>
And it'll know to go there. Interesting way of showing it things.

And we're onto another topic. Seed pages.

How do you index, for example all the unix manpages? This is something else
that I find to be of practical use.

I have a short CGI script, man.cgi, from here:
http://www.oac.uci.edu/indiv/ehood/man2html.html
It's nice and configurable. Looks fairly reasonable, does what it's meant to
do.

But how would you tell mnogosearch how to index every manpage? It'd get
mighty boring mighty quickly if you had to run
"./indexer -u long-url"
for every manpage. Approximately 9600 of them on the host I run mnogo on.

So, instead, I needed a "seed page" to point the indexer at.

Well.


<snip>
#!/bin/sh
echo "<HTML><BODY>" > ./manpages.html
echo "<meta name=\"robots\" content=\"noindex,follow\">" >> ./manpages.html

(for d in /home/gbriggs/man /opt/man /usr/man /usr/openwin/man /usr/dt/man
/opt/SUNWspro/man /opt/gnu/man ;do find $d -type f ;done) | sed 's/^.*\///g'
| sort -u | awk '{print "<A
HREF=\"/cgi-bin/manpages/man.cgi?section=all&topic="$1"\">"$1"</A>";}' >>
./manpages.html

echo "</BODY></HTML>" >> ./manpages.html
</snip>


And the it'll index all the manpages, but will leave out that one, thanks to
the robots meta tag.

And yes, I know my code could be improved greatly, but I needed something
and I needed it fast [at the time].


Anyone else's experience on this or other similar things?

Thank-you very much,
Gary (-;
___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]
Theoretical exercises in searching

Reply via email to