U can also use the html parser API available from Java 1.4.2. I tried it last week with a simple program which retrieve html files and displaying all HREF links in it. I done it within a day.
sebastian On Thu, 2004-04-22 at 12:18, Stephane James Vaucher wrote: > How big is the site? > > I mostly use an inhouse solution, but I've used HttpUnit for web scrapping > small sites (because of its high-level api). > > Here is a hello world example: > http://wiki.apache.org/jakarta-lucene/HttpUnitExample > > For a small/simple site, small modifications to this class could suffice. > IT WILL NOT function on large sites because of memory problems. > > For larger sites, there are questions like: > > - memory: > For example, spidering all links on every page can lead to visiting too > many links. Keeping all visited links in memory can be problematic > > - noise > If you get every page on your web site, you might be adding noise to the > search engine. Spider navigation rules can help out, like saying that you > should only follow links/index documents of a specific form like > www.mysite.com/news/article.jsp?articleid=xxx > > - speed: > Too much speed can be bad if you doing 100 hits/sec on a site could hurt > it (especially if it's not you who are the webmaster) > Too little speed can be bad if you want to make sure you quickly get new > pages. > > - categorisation: > You might want to separate information in your index. For example, you > might want a user to do a search in the documentation section or in the > press release section. This categorisation can be done by specifying > sections to the site, or a subsequent analysis of available docs. > > -up-to-date information > You'll want to think of your update schedule, so that if you add a new > page, it gets indexed quickly. This problem also occurs when you modify an > existing page, you might want the modification to be detected rapidly. > > HTH, > sv > > On Thu, 22 Apr 2004, Tuan Jean Tee wrote: > > > Have anyone implemented any open source web crawler with Lucene? I have > > a dynamic website and are looking at putting in a search tools. Your > > advice is very much appreciated. > > > > Thank you. > > > > > > IMPORTANT - > > > > This email and any attachments are confidential and may be privileged in > > which case neither is intended to be waived. If you have received this > > message in error, please notify us and remove it from your system. It is > > your responsibility to check any attachments for viruses and defects > > before opening or sending them on. Where applicable, liability is > > limited by the Solicitors Scheme approved under the Professional > > Standards Act 1994 (NSW). Minter Ellison collects personal information > > to provide and market our services. For more information about use, > > disclosure and access, see our privacy policy at www.minterellison.com. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]