[htdig] Re: indexing Flash (was: excluding page section...)
Theoretically, Flash is supposed to put links and text into the HTML file if you check those options. Unfortunately, it sticks them in comment fields. I've had inconsistent behavior with getting it to do even that! Macromedia did publish a Flash file access API or something, but it's not open source as far as I know. I'm working on a report on indexing Flash, so if anyone has a text-heavy example, I'd love to see it! Avi At 1:04 PM +0100 1/17/01, Torsten Neuer wrote: Is there a possibility to index Shockwave Flash files? This is a bit harder. I searched the web for an existing parser but only found some more-or-less useful docs and one generic parser. This generic parser (see attachment) can easily be used within a wrapper script to at least extract links from a flash menu, which in my opinion is the most requested feature. Thanx for this one, but I'll need a bit more time to check it. Anway, extracting links is not enough, i think. keywords or full text index are needed. Well, full text index should also be possible, but requires some more work on the parser. The attached one is just a very generic one which dumps all the different record entries of a flash file. It is not de- signed to be an axternal parser for Ht://Dig, but it works well with the shell wrapper to extract links from flash menus. With some addi- tional work it shoudl be possible to produce a fully fledged external parser out of it (yet, I haven't found the time nor did I have some projects depending on that). -- _ Complete Guide to Search Engines for Web Sites, Intranets, and Portals: http://www.searchtools.com To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: AW: [htdig] Going for the big dig
At 9:35 AM +0100 12/20/00, Reich, Stefan wrote: Not quite sure if this helps, but maybe ;-) If I'm right, Lotus.com is running on Notes Domino something servers. We've experienced lots of problems with this notes servers, because of their meshed link structure. I did some testing on this a while back and recommend that robots should ignore all URLs that contain but do not end in "OpenDocument" and "OpenViewCollapseView" and ignore all URLs that contain "ExpandView" or "OpenViewStart. Hope it helps, Avi -- _ Complete Guide to Search Engines for Web Sites, Intranets, and Portals: http://www.searchtools.com To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Question about search engine
At 11:41 AM -0800 11/20/2000, Doug Barton wrote: Dmitry Lesov wrote: Dear Sir/Madam I have a specific question about the search engine. I am looking for the search engine smart enough to target HTML pages back into original frame set. Does your search engine have this capability? Please reply ASAP.Thank You No search engine I know of does, it's actually a very difficult problem to solve. However, there are some javascript tricks you can use. Take a look at http://www.zdnet.com/devhead/stories/articles/0,4413,2438662,00.html Actually, MondoSearch does this rather nicely http://www.mondosearch.com. The drawback is that they have to reindex the entire site to do an update. Avi -- _ Complete Guide to Search Engines for Web Sites, Intranets, and Portals: http://www.searchtools.com To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] ssl patch
Dear Jeremy, I'm doing some research on people indexing data using SSL. Could you tell me a little more about your data? Is it just forms for entry, or is it personal or business records of some kind? Are you doing something interesting with security for the resulting index and search results? I'll be happy to let you know when I finish my report. Thanks, Avi At 3:15 PM -0700 11/16/2000, Jeremy Lyon wrote: Hi I just tried to patch htdig 3.1.5 with the ssl patch ftp://ftp.ccsf.org/htdig-patches/3.1.5/ssl.2 to a clean htdig. I got these errors -- _ Complete Guide to Search Engines for Web Sites, Intranets, and Portals: http://www.searchtools.com To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] what SSL pages are people indexing?
I'm interested in what kinds of SSL pages are appropriate for indexing -- are they order records, customer information, private logs? Do you index non-SSL and SSL data in the same pass? Also, what, if any, precautions are you taking for securing the resulting indexes and search forms? Please reply to me privately and I'll summarize for the list. Thanks, Avi -- The Complete Guide to Site, Intranet and Portal Search Engines mailto:[EMAIL PROTECTED] http://www.searchtools.com To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Including Pull-Down Menu Pages
I don't think you quite understand the magnitude of your request. Have you looked at all the other search engines? Can you name me *any* which can indeed do that? I have been testing search engines for a year and have yet to find one which can deal with JavaScript links, including the ones which cost thousands of dollars. One search service (atomz.com) will deal with raw Flash files. I do keep mentioning this issue to my colleagues who write search engines, but without luck so far. Blaming the ht://Dig Group that is just silly. It's open source -- if you think it should be done, you can always do it yourself. Avi PS generating a list of the At 4:24 PM -0400 10/20/2000, Douglas Kline wrote: I think that the inability of the search engine to find pages referenced through the menu bar and not by hyper-links is a significant disadvantage. Web programmers who employ these menu bars may not know that they won't be traversed by search engines and may not think about it and use LINK tags even if they know. Whoever maintains the search engine may not know either. Even if they know or find out eventually, actually installing LINK tags for all pages referenced through menu bars and not through hyperlinks might be quite difficult and even making sure that all Web programmers are aware of the need for LINK tags for future use would be problematic. -- The Complete Guide to Site, Intranet and Portal Search Engines mailto:[EMAIL PROTECTED] http://www.searchtools.com To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
RE: [htdig] searching in flash movies
At 4:21 PM +0100 9/7/2000, Srini Sathya. wrote: i am having exactly the same problem I have an flash file (swf) which contains all the hyperlink to the main site. I wanna to dig all the relative sites which that flash file contains. Is this achievable??. Currently i am creating an link manually for all those links and then digging. This, considering the long-term requirements will not help as lot of sub-directories will getting added, is there any workaround?? If you have any control over the content, Flash has several ways to add a visible version of the text in the movie. I believe there's an option when you generate the movie, and I know the AfterShock utility will build both text and HTML links automatically. There must be a way to get into the Flash movie because the Atomz.com search engine can index them. I think they made a deal with Macromedia for the file format though. In the long run, I sure hope SVG wins over Flash -- it's a proper XML-based markup language for generating movies and effects, but you can read it without stress. Avi -- The Complete Guide to Site, Intranet and Portal Search Engines mailto:[EMAIL PROTECTED] http://www.searchtools.com To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Not indexing a word
At 8:17 AM -0600 2/23/2000, Geoff Hutchison wrote: At 12:31 PM + 2/23/00, Malcolm Austen wrote: + I'm afraid you can't say "index this word," though that's not a bad + idea (a "good words" list?) OK, let's ask for the sky ... how about (at some far distant point) a "good phrases" list please? The context of this request is that I don't want to index all instances of "it" but I would like to index "IT" in the context of "IT Committee" 8-) Yes, this would be nice, wouldn't it. Adding a "good words" list isn't so bad--you check it quickly before tossing the word. The difficulty of your request is that it would change the way documents are parsed--right now they're split up into words, so you'd have to say "wait, we just saw 'committee,' did we have 'IT' just then?" You could still do it, but it would be a bit more complex. Most of the large search engines I've seen no longer ignore short words and stopwords -- they just index everything. I realize it requires a lot more disk space (though there may be some clever ways around that), but it simplifies things both internally and for the end-users. That way, they can search for "To Be Or Not To Be" and find something! My rule for search engines is "no surprises", and I think there are enough legitimate instances of people needing to search two and even one-letter words that ht://Dig should allow that as an option. Avi To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Alta Vista goes Open source
Nope, not open-source, the reporter got confused with HTML code. You know how it goes when they're in a hurry. Avi At 9:01 AM -0500 2/1/2000, Charlie Romero wrote: This article is in the news this morning. I havn't read AV's press release yet. But this sounds like an unbelievable opportunity to combine the best of both worlds. This is a link to the news article on Excite: http://news.excite.com/news/zd/000201/05/altavista-opens-its Charlie To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Alta Vista goes Open source
At 12:23 PM -0600 2/1/2000, Geoff Hutchison wrote: At 10:13 AM -0800 2/1/00, [EMAIL PROTECTED] wrote: Nope, not open-source, the reporter got confused with HTML code. You know how it goes when they're in a hurry. Actually, they use the term "open source" in their press release, so it's not an issue with ZDNet or the reporter. It's an issue with AltaVista. Yeah, the release was stupid but they just meant HTML. I could imagine that they'll give the source to members of their affiliate program, but many people already had the binaries. Besides, as many people point out, this is for their "intranet" program, which is not the software they run themselves. Nope, they have no intention of giving away source to AltaVista Search (the one they sell). Confirmed by the product manager by phone this morning. Interestingly, the code is that which they run on the main servers. So you know it scales nicely (if you invest in a server farm). Avi To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] double-byte characters?
Can ht://Dig handle Japanese, Korean or Chinese? Unicode? According to the archives, folks were talking about this last year, but there was no clear resolution. Thanks, Avi To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Re: Request for a CGI-based admin interface
I think this is a *splendid* idea, as I never seem to find the time to decode htdig config files. I'd very much like to help out with wording and organization, as I'm currently reviewing AltaVista Search, Ultraseek, Verity, Site Server and Excalibur, all of which have some form of browser admin. Ultraseek's is best, though they don't allow results page customization. For that, the best options are the remote search services Atomz and PinPoint. But please don't design the system so it's always limited to "a few minimal attributes" -- better to put the structures in place for providing complete control. Avi At 10:38 PM -0600 11/30/1999, Geoff Hutchison wrote: Hi, One of the complaints I hear periodically is that ht://Dig is a bit difficult to administer. I've heard more than one person say that they'd think about writing something to do this, probably web/CGI based. However, the closest thing that I've seen are modified "rundig" scripts, including my multidig scripts. OK, time to put our money where our mouths are. :-) I was poking around the various open-source development bazaars just now and I saw two offers for htdig on Cosource.com, one for a SQL backend and one for using ht://Dig to index KDevelop. I'll make an additional entry. I'll pony up $75 to see someone write a useful admin interface for ht://Dig. Minimally, it would be web-based (preferably in perl), let you handle multiple databases, add URLs to a database, set a few minimal attributes like exclude_urls, and mail you when indexing is done with a nice summary message. I'll be writing up a slightly more detailed description shortly. I'll also include links to all three offers in the "Recent News" section on http://www.htdig.org/ -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Frame problems!
At 2:46 PM +0100 10/19/99, Max wrote: hi Andy, thanks for responding to me. The prob is though that I can't have any links on the "content" page, only on the menus at the side frame. So if someone finds your site from a search engine, they can't get anywhere from a found page? This seems like a mistake. Even if you put in JavaScript to always bring up the frame context, it won't work for a lot of browsers. I recommend that you always have basic links on the content page, so people can get *somewhere* from there. All the good web info architects and UI designers I know, including Jakob Nielsen, Jennifer Fleming and Lou Rosenfeld, say "don't let people get into dead ends". Avi PS I wrote a white paper for Ultraseek on robot indexing, all about the problems with frames, JavaScript, image maps, and so on. You might find it useful -- it's at http://software.infoseek.com/products/ultraseek/docs/wp-spider/defaul t.htm To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] how it works, question
If you're using a robot to index pages, there is no way for it to know about the contents of your directory. All it can know about is the pages linked from somewhere else -- the local directory listing is not available. You could make a link page that only has listings of pages, and include that with your starting point. Be sure to set the META ROBOTS tag to NOINDEX,FOLLOW, so the page itself is not indexed. Hope that helps, Avi At 7:31 AM -0700 9/15/99, Sadhunathan Nadesan wrote: hm i have been searching the web site, faq's etc, for how htdig actually works, and, it doesn't tell, although gives a clue. rather than digging in the source, can someone confirm this? htdig only follows links is that so obvious that everyone assumes it? wasn't obvious to me, if it is true. i expected it to recursively search every sub directory under the start url looking for all .html or text files. now i am beginning to think that perhaps this is a false assumption. in other words, if i have an index.html page in the start url, and it doesnt happen to have any links to many subdirectories beneath it which also have html pages .. none of the other pages get indexed. is that the case? if so, perhaps this info ought to be placed in the faq. if not, i am still stuck as to why it doesnt find everything under a start url. the problem being, i have many directories with html pages which are not pointed to by any html page on the site, the links are on other servers (not necessarily being indexed). so i guess i have to list each directory explicitly then??? well, any comments appreciated, thank you sadhu To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] Indexing the Internet
At 2:16 PM -0500 9/6/1999, Geoff Hutchison wrote: I don't think you need to worry about it not being able to do it. IMHO (yes, I'm biased), there's very little separating ht://Dig from a commercial package technically. Granted, it doesn't have a nice clean admin interface, but if you just want to set something up and leave it alone, you'll likely never notice. IMNSHO, ht://Dig is better than most of the commercial packages. It can't compete at the very top level (millions of documents, thousands of concurrent requests), but is certainly an excellent search engine. Now, if only someone would write a browser front end to the admin, I would be a happy camper. Avu To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] htdig-ing Sites with complex framesets
At 4:03 PM +0100 7/27/1999, Rzepa, Henry wrote: Is anyone aware of any issues with the use of framesets that call other framesets? There was some dicussion of nested framesets on June 11, but I was not sure from that whether htdig had specific problems with such framesets You might also want to make a policy of including links in the NOFRAMES tags section of each tagset page. Makes life a lot easier for robots, non-graphical HTTP clients, PDFs and appliances, etc. Adding NOFRAMES links also makes it possible for blind and visually-impaired people trying to traverse your site with speaking browsers. It's the Right Thing To Do, and in the US, government and other publicly-funded sites should always consider accessibility requirements. Check out the Web Accessibility Initiative at http://www.w3.org/WAI/ and the Bobby accessibility checker at http://www.cast.org/bobby/ for more info. Avi To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.