[htdig] Re: indexing Flash (was: excluding page section...)

2001-01-17 Thread nets

Theoretically, Flash is supposed to put links and text into the HTML 
file if you check those options.  Unfortunately, it sticks them in 
comment fields.  I've had inconsistent behavior with getting it to do 
even that!

Macromedia did publish a Flash file access API or something, but it's 
not open source as far as I know.

I'm working on a report on indexing Flash, so if anyone has a 
text-heavy example, I'd love to see it!

Avi

At 1:04 PM +0100 1/17/01, Torsten Neuer wrote:
 Is there a possibility to index Shockwave Flash files?
   This is a bit harder.  I searched the web for an existing parser but
   only
   found some more-or-less useful docs and one generic parser.
  
   This generic parser (see attachment) can easily be used within a wrapper
   script to at least extract links from a flash menu, which in my opinion
   is
   the most requested feature.

  Thanx for this one, but I'll need a bit more time to check it.
  Anway, extracting links is not enough, i think. keywords or full text
  index are needed.

Well, full text index should also be possible, but requires some more
work on the parser.  The attached one is just a very generic one which
dumps all the different record entries of a flash file.  It is not de-
signed to be an axternal parser for Ht://Dig, but it works well with
the shell wrapper to extract links from flash menus.  With some addi-
tional work it shoudl be possible to produce a fully fledged external
parser out of it (yet, I haven't found the time nor did I have some
projects depending on that).


-- 
_
Complete Guide to Search Engines for Web Sites, Intranets, 
   and Portals: http://www.searchtools.com


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: AW: [htdig] Going for the big dig

2000-12-20 Thread nets

At 9:35 AM +0100 12/20/00, Reich, Stefan wrote:
Not quite sure if this helps, but maybe ;-)

If I'm right, Lotus.com is running on Notes Domino something servers.

We've experienced lots of problems with this notes servers, because of their
meshed link structure.

I did some testing on this a while back and recommend  that robots 
should ignore all URLs that contain but do not end in "OpenDocument" 
and "OpenViewCollapseView" and ignore all URLs that contain 
"ExpandView" or "OpenViewStart.

Hope it helps,

Avi

-- 
_
Complete Guide to Search Engines for Web Sites, Intranets, 
   and Portals: http://www.searchtools.com


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Question about search engine

2000-11-20 Thread nets

At 11:41 AM -0800 11/20/2000, Doug Barton wrote:
Dmitry Lesov wrote:

  Dear Sir/Madam
  I have a specific question about the search engine. I am looking for the

  search engine smart enough to target HTML pages back into original
  frame set. Does your search engine have this capability? Please
  reply ASAP.Thank You

   No search engine I know of does, it's actually a very 
difficult problem to
solve. However, there are some javascript tricks you can use. Take a look
at http://www.zdnet.com/devhead/stories/articles/0,4413,2438662,00.html

Actually, MondoSearch does this rather nicely 
http://www.mondosearch.com.  The drawback is that they have to 
reindex the entire site to do an update.

Avi


-- 
_
Complete Guide to Search Engines for Web Sites, Intranets, 
   and Portals: http://www.searchtools.com


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] ssl patch

2000-11-16 Thread nets

Dear Jeremy,

I'm doing some research on people indexing data using SSL.  Could you 
tell me a little more about your data?  Is it just forms for entry, 
or is it personal or business records of some kind?  Are you doing 
something interesting with security for the resulting index and 
search results?  I'll be happy to let you know when I finish my 
report.

Thanks,

Avi

At 3:15 PM -0700 11/16/2000, Jeremy Lyon wrote:
Hi

I just tried to patch htdig 3.1.5 with the ssl patch
ftp://ftp.ccsf.org/htdig-patches/3.1.5/ssl.2 to a clean htdig.  I got
these errors


-- 
_
Complete Guide to Search Engines for Web Sites, Intranets, 
   and Portals: http://www.searchtools.com


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] what SSL pages are people indexing?

2000-11-02 Thread nets

I'm interested in what kinds of SSL pages are appropriate for 
indexing -- are they order records, customer information, private 
logs?  Do you index non-SSL and SSL data in the same pass?  Also, 
what, if any, precautions are you taking for securing the resulting 
indexes and search forms?

Please reply to me privately and I'll summarize for the list.

Thanks,

Avi
-- 

The Complete Guide to Site, Intranet and Portal Search Engines
mailto:[EMAIL PROTECTED]  http://www.searchtools.com


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Including Pull-Down Menu Pages

2000-10-20 Thread nets

I don't think you quite understand the magnitude of your request. 
Have you looked at all the other search engines?  Can you name me 
*any* which can indeed do that?  I have been testing search engines 
for a year and have yet to find one which can deal with JavaScript 
links, including the ones which cost thousands of dollars.  One 
search service (atomz.com) will deal with raw Flash files.  I do keep 
mentioning this issue to my colleagues who write search engines, but 
without luck so far.

Blaming the ht://Dig Group that is just silly.  It's open source -- 
if you think it should be done, you can always do it yourself.

Avi

PS generating a list of the

At 4:24 PM -0400 10/20/2000, Douglas Kline wrote:
I think that the inability of the search engine to find pages referenced
through the menu bar and not by hyper-links is a significant 
disadvantage.  Web
programmers who employ these menu bars may not know that they won't be
traversed by search engines and may not think about it and use 
LINK tags even
if they know.  Whoever maintains the search engine may not know either.  Even
if they know or find out eventually, actually installing LINK tags for all
pages referenced through menu bars and not through hyperlinks might be quite
difficult and even making sure that all Web programmers are aware of the need
for LINK tags for future use would be problematic.

-- 

The Complete Guide to Site, Intranet and Portal Search Engines
mailto:[EMAIL PROTECTED]  http://www.searchtools.com


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




RE: [htdig] searching in flash movies

2000-09-07 Thread nets

At 4:21 PM +0100 9/7/2000, Srini Sathya. wrote:
i am having exactly the same problem  I have an flash file (swf) which
contains all the hyperlink to the main site.  I wanna to dig all the
relative sites which that flash file contains.  Is this achievable??.
Currently i am creating an link manually for all those links and then
digging.  This, considering the long-term requirements will not help as lot
of sub-directories will getting added, is there any workaround??

If you have any control over the content, Flash has several ways to 
add a visible version of the text in the movie.  I believe there's an 
option when you generate the movie, and I know the AfterShock utility 
will build both text and HTML links automatically.

There must be a way to get into the Flash movie because the Atomz.com 
search engine can index them.  I think they made a deal with 
Macromedia for the file format though.

In the long run, I sure hope SVG wins over Flash -- it's a proper 
XML-based markup language for generating movies and effects, but you 
can read it without stress.

Avi
-- 

The Complete Guide to Site, Intranet and Portal Search Engines
mailto:[EMAIL PROTECTED]  http://www.searchtools.com


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Not indexing a word

2000-02-23 Thread nets

At 8:17 AM -0600 2/23/2000, Geoff Hutchison wrote:
At 12:31 PM + 2/23/00, Malcolm Austen wrote:
+ I'm afraid you can't say "index this word," though that's not a bad
+ idea (a "good words" list?)

OK, let's ask for the sky ... how about (at some far distant point) a
"good phrases" list please?

The context of this request is that I don't want to index all instances of
"it" but I would like to index "IT" in the context of "IT Committee" 8-)

Yes, this would be nice, wouldn't it. Adding a "good words" list 
isn't so bad--you check it quickly before tossing the word. The 
difficulty of your request is that it would change the way documents 
are parsed--right now they're split up into words, so you'd have to 
say "wait, we just saw 'committee,' did we have 'IT' just then?" You 
could still do it, but it would be a bit more complex.

Most of the large search engines I've seen no longer ignore short 
words and stopwords -- they just index everything.  I realize it 
requires a lot more disk space (though there may be some clever ways 
around that), but it simplifies things both internally and for the 
end-users.  That way, they can search for "To Be Or Not To Be" and 
find something!

My rule for search engines is "no surprises", and I think there are 
enough legitimate instances of people needing to search two and even 
one-letter words that ht://Dig should allow that as an option.

Avi


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] Alta Vista goes Open source

2000-02-01 Thread nets

Nope, not open-source, the reporter got confused with HTML code.  You 
know how it goes when they're in a hurry.

Avi

At 9:01 AM -0500 2/1/2000, Charlie Romero wrote:
This article is in the news this morning. I havn't read AV's press 
release yet. But this sounds like an unbelievable opportunity to 
combine the best of both worlds.

This is a link to the news article on Excite:

http://news.excite.com/news/zd/000201/05/altavista-opens-its

Charlie


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] Alta Vista goes Open source

2000-02-01 Thread nets

At 12:23 PM -0600 2/1/2000, Geoff Hutchison wrote:
At 10:13 AM -0800 2/1/00, [EMAIL PROTECTED] wrote:
Nope, not open-source, the reporter got confused with HTML code. 
You know how it goes when they're in a hurry.

Actually, they use the term "open source" in their press release, so 
it's not an issue with ZDNet or the reporter. It's an issue with 
AltaVista.

Yeah, the release was stupid but they just meant HTML.

I could imagine that they'll give the source to members of their 
affiliate program, but many people already had the binaries. 
Besides, as many people point out, this is for their "intranet" 
program, which is not the software they run themselves.

Nope, they have no intention of giving away source to AltaVista 
Search (the one they sell).  Confirmed by the product manager by 
phone this morning.

Interestingly, the code is that which they run on the main servers. 
So you know it scales nicely (if you invest in a server farm).

Avi


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



[htdig] double-byte characters?

1999-12-22 Thread nets

Can ht://Dig handle Japanese, Korean or Chinese?  Unicode?  According 
to the archives, folks were talking about this last year, but there 
was no clear resolution.

Thanks,

Avi


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



[htdig] Re: Request for a CGI-based admin interface

1999-12-01 Thread nets

I think this is a *splendid* idea, as I never seem to find the time 
to decode htdig config files.

I'd very much like to help out with wording and organization, as I'm 
currently reviewing AltaVista Search, Ultraseek, Verity, Site Server 
and Excalibur, all of which have some form of browser admin. 
Ultraseek's is best, though they don't allow results page 
customization.  For that, the best options are the remote search 
services Atomz and PinPoint.

But please don't design the system so it's always limited to  "a few 
minimal attributes" -- better to put the structures in place for 
providing complete control.

Avi

At 10:38 PM -0600 11/30/1999, Geoff Hutchison wrote:
Hi,

One of the complaints I hear periodically is that ht://Dig is a bit 
difficult to administer. I've heard more than one person say that 
they'd think about writing something to do this, probably web/CGI 
based. However, the closest thing that I've seen are modified 
"rundig" scripts, including my multidig scripts.

OK, time to put our money where our mouths are. :-)

I was poking around the various open-source development bazaars just 
now and I saw two offers for htdig on Cosource.com, one for a SQL 
backend and one for using ht://Dig to index KDevelop.

I'll make an additional entry. I'll pony up $75 to see someone write 
a useful admin interface for ht://Dig. Minimally, it would be 
web-based (preferably in perl), let you handle multiple databases, 
add URLs to a database, set a few minimal attributes like 
exclude_urls, and mail you when indexing is done with a nice summary 
message.

I'll be writing up a slightly more detailed description shortly. 
I'll also include links to all three offers in the "Recent News" 
section on http://www.htdig.org/

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] Frame problems!

1999-10-19 Thread nets


At 2:46 PM +0100 10/19/99, Max wrote:
hi Andy,
thanks for responding to me. The prob is though that I can't have any links on
the "content" page, only on the menus at the side frame.

So if someone finds your site from a search engine, they can't get 
anywhere from a found page?  This seems like a mistake.  Even if you 
put in JavaScript to always bring up the frame context, it won't work 
for a lot of browsers.

I recommend that you always have basic links on the content page, so 
people can get *somewhere* from there.  All the good web info 
architects and UI designers I know, including Jakob Nielsen, Jennifer 
Fleming and Lou Rosenfeld, say "don't let people get into dead ends".

Avi

PS I wrote a white paper for Ultraseek on robot indexing, all about 
the problems with frames, JavaScript, image maps, and so on.  You 
might find it useful -- it's at 
http://software.infoseek.com/products/ultraseek/docs/wp-spider/defaul 
t.htm


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] how it works, question

1999-09-15 Thread nets


If you're using a robot to index pages, there is no way for it to 
know about the contents of your directory.  All it can know about is 
the pages linked from somewhere else -- the local directory listing 
is not available.

You could make a link page that only has listings of pages, and 
include that with your starting point.  Be sure to set the META 
ROBOTS tag to NOINDEX,FOLLOW, so the page itself is not indexed.

Hope that helps,

Avi

At 7:31 AM -0700 9/15/99, Sadhunathan Nadesan wrote:
hm   i have been searching the web site, faq's etc, for how htdig
actually works, and, it doesn't tell, although gives a clue.  rather than
digging in the source, can someone confirm this?

htdig only follows links


is that so obvious that everyone assumes it?  wasn't obvious to me, if it
is true.  i expected it to recursively search every sub directory under the
start url looking for all .html or text files.  now i am beginning to think
that perhaps this is a false assumption. 

in other words, if i have an index.html page in the start url, and it
doesnt happen to have any links to many subdirectories beneath it which
also have html pages .. none of the other pages get indexed.  is that the
case?  if so, perhaps this info ought to be placed in the faq.  if not, i
am still stuck as to why it doesnt find everything under a start url.

the problem being, i have many directories with html pages which are not
pointed to by any html page on the site, the links are on other servers
(not necessarily being indexed).  so i guess i have to list each directory
explicitly then???

well, any comments appreciated,
thank you
sadhu




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] Indexing the Internet

1999-09-06 Thread nets


At 2:16 PM -0500 9/6/1999, Geoff Hutchison wrote:

I don't think you need to worry about it not being able to do it. 
IMHO (yes, I'm biased), there's very little separating ht://Dig from 
a commercial package technically. Granted, it doesn't have a nice 
clean admin interface, but if you just want to set something up and 
leave it alone, you'll likely never notice.

IMNSHO, ht://Dig is better than most of the commercial packages.   It 
can't compete at the very top level (millions of documents, thousands 
of concurrent requests), but is certainly an excellent search engine.

Now, if only someone would write a browser front end to the admin, I 
would be a happy camper.

Avu


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] htdig-ing Sites with complex framesets

1999-07-27 Thread nets


At 4:03 PM +0100 7/27/1999, Rzepa, Henry wrote:

Is anyone aware of any issues with the use of framesets that call other
framesets?   There was some dicussion of nested framesets on June  11,
but I was not sure from that whether htdig had specific problems with such
framesets

You might also want to make a policy of including links in the 
NOFRAMES tags section of each tagset page.  Makes life a lot easier 
for robots, non-graphical HTTP clients, PDFs and appliances, etc.

Adding NOFRAMES links also makes it possible for blind and 
visually-impaired people trying to traverse your site with speaking 
browsers.  It's the Right Thing To Do, and in the US, government and 
other publicly-funded sites should always consider accessibility 
requirements.   Check out the Web Accessibility Initiative at 
http://www.w3.org/WAI/ and the Bobby accessibility checker at 
http://www.cast.org/bobby/ for more info.

Avi



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.