Hi, Would it be possible to include in Nutch, the ability to crawl & download a page only if the page has been updated since the last crawl? I had read sometime back that there were plans to include such a feature. It would be a very useful feature to have IMO. This of course depends on the "last modified" timestamp being present on the webpage that is being crawled, which I believe is not mandatory. Still those who do set it would benefit.
Thanks, -sroy On Mon, Nov 9, 2009 at 9:54 PM, Andrzej Bialecki <a...@getopt.org> wrote: > Hi all, > > The ApacheCon is over, our release 1.0 has been out already for some time, > so I think it's a good moment to discuss what are the next steps in Nutch > development. > > Let me share with you the topics I identified and presented in the > ApacheCon slides, and some topics that are worth discussing based on various > conversations I had there, and the discussions we had on our mailing list: > > 1. Avoid duplication of effort > ------------------------------ > Currently we spend significant effort on implementing functionality that > other projects are dedicated to. Instead of doing the same work, and > sometimes poorly, we should concentrate on delegating and reusing: > > * Use Tika for content parsing: this will require some effort and > collaboration with the Tika project, to improve Tika's ability to handle > more complex formats well (e.g. hierarchical compound documents such as > archives, mailboxes, RSS), and to contribute any missing parsers (e.g. > parse-swf). > > * Use Solr for indexing & search: it is hard to justify the effort of > developing and maintaining our own search server - Solr offers much more > functionality, configurability, performance and ease of integration than our > relatively primitive search server. Our integration with Solr needs to be > improved so that it's easier to setup and operate. > > * Use database-like storage abstraction: this may seem like a serious > departure from the current architecture, but I don't mean that we should > switch to an SQL DB ... what this means is that we should provide an option > to use HBase, as well as the current plain MapFile-s (and perhaps other > types of DBs, such as Berkeley DB or SQL, if it makes sense) as our storage. > There is a very promising initial port of Nutch to HBase, which is currently > closely integrated with HBase API (which is both good and bad) - it provides > several improvements over our current storage, so I think it's worth using > as the new default, but let's see if we can make it more abstract. > > * Plugins: the initial OSGI port looks good, but I'm not sure yet at this > moment if the benefits of OSGI outweigh the cost of this change ... > > * Shard management: this is currently an Achilles' heel of Nutch, where > users are left on their own ... If we switch to using HBase then at least on > the crawling side the shard management will become much easier. This still > leaves the problem of deploying new content to search server(s). The > candidate framework for this side of the shard management is Katta + patches > provided by Ted Dunning (see ???). If we switch to using Solr we would have > to also use the Katta / Solr integration, and perhaps Solr/Hadoop > integration as well. This is a complex mix of half-ready components that > needs to be well thought-through ... > > * Crawler Commons: during our Crawler MeetUp all representatives agreed > that we should collect a few components that are nearly the same across all > projects and collaborate on their development, and use them as an external > dependency. The candidate components are: > > - robots.txt parsing > - URL filtering and normalization > - page signature (fingerprint) implementations > - page template detection & removal (aka. main content extraction) > - possibly others, like URL redirection tracking, PageRank calculation, > crawler trap detection etc. > > 2. Make Nutch easier to use > --------------------------- > This, as you may remember our earlier discussions, begs the question: who > is the target audience of Nutch? > > In my opinion, the main users of Nutch are vertical search engines, and > this is the audience that we should cater to. There are many reasons for > this: > > - Nutch is too complex and too heavy for those that need to crawl up to a > few thousand pages. Now that the Droids project exists it's probably not > worth the effort to attempt a complete re-design of Nutch so that it fits > the need of this group - Nutch is based on map-reduce, and it's not likely > we would want to change that, so this means there will always be a > significant overhead for small jobs. I'm not saying we should not make Nutch > easier to use, but for small crawls Nutch is an overkill. Also, in many > cases these users don't realize that they don't do any frontier discovery > and expansion, and what they really need is Solr. > > - at the other end of the spectrum, there are very very few companies that > want to do a wide large web-scale crawling - this is costly, and requires a > solid business plan and serious funding. These users are prepared anyway to > spend significant effort on customizations and problem-solving, or they may > want to use only some parts of Nutch. Often they are also not too eager to > contribute back to the project - either because of their proprietary nature > or because their customizations are not useful for general audience. > > The remaining group is interested in medium-size, high quality crawling > (focused, with good spam & junk controls). Which is either an enterprise > search or a vertical search. We should make Nutch an attractive platform for > such users, and we should discuss what this entails. Also, if we refactor > Nutch in the way I described above, it will be easier for such users to > contribute back to Nutch and other related projects. > > 3. Provide a platform for solving the really interesting issues > --------------------------------------------------------------- > Nutch has many bits and pieces that implement really smart algorithms and > heuristics to solve difficult issues that occur in crawling. The problem is > that they are often well hidden and poorly documented, and their interaction > with the rest of the system is far from obvious. Sometimes this is related > to premature performance optimizations, in other cases this is just a poorly > abstracted design. Examples would include the OPIC scoring, meta-tags & > metadata handling, deduplication, redirection handling, etc. > > Even though these components are usually implemented as plugins, this lack > of transparency and poor design makes it difficult to experiment with Nutch. > I believe that improving this area will result in many more users > contributing back to the project, both from business and from academia. > > And there are quite a few interesting challenges to solve: > > * crawl scheduling, i.e. determining the order and composition of > fetchlists to maximize the crawling speed. > > * spam & junk detection (I won't go into details on this, there are tons of > literature on the subject) > > * crawler trap handling (e.g. the classic calendar page that generates > infinite number of pages). > > * enterprise-specific ranking and scoring. This includes users' feedback > (explicit and implicit, e.g. click-throughs) > > * pagelet-level crawling (e.g. portals, RSS feeds, discussion fora) > > * near-duplicate detection, and closely related issue of extraction of the > main content from a templated page. > > * URL aliasing (e.g. www.a.com == a.com == a.com/index.html == > a.com/default.asp), and what happens with inlinks to such aliased pages. > Also related to this is the problem of temporary/permanent redirects and > complete mirrors. > > Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an > attractive platform to develop and experiment with such components. > > ----------------- > Briefly ;) that's what comes to my mind when I think about the future of > Nutch. I invite you all to share your thoughts and suggestions! > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source) email: s...@profound.in http://www.profound.in