Hi,

Would it be possible to include in Nutch, the ability to crawl & download a
page only if the page has been updated since the last crawl? I had read
sometime back that there were plans to include such a feature. It would be a
very useful feature to have IMO. This of course depends on the "last
modified" timestamp being present on the webpage that is being crawled,
which I believe is not mandatory. Still those who do set it would benefit.

Thanks,
-sroy

On Mon, Nov 9, 2009 at 9:54 PM, Andrzej Bialecki <a...@getopt.org> wrote:

> Hi all,
>
> The ApacheCon is over, our release 1.0 has been out already for some time,
> so I think it's a good moment to discuss what are the next steps in Nutch
> development.
>
> Let me share with you the topics I identified and presented in the
> ApacheCon slides, and some topics that are worth discussing based on various
> conversations I had there, and the discussions we had on our mailing list:
>
> 1. Avoid duplication of effort
> ------------------------------
> Currently we spend significant effort on implementing functionality that
> other projects are dedicated to. Instead of doing the same work, and
> sometimes poorly, we should concentrate on delegating and reusing:
>
> * Use Tika for content parsing: this will require some effort and
> collaboration with the Tika project, to improve Tika's ability to handle
> more complex formats well (e.g. hierarchical compound documents such as
> archives, mailboxes, RSS), and to contribute any missing parsers (e.g.
> parse-swf).
>
> * Use Solr for indexing & search: it is hard to justify the effort of
> developing and maintaining our own search server - Solr offers much more
> functionality, configurability, performance and ease of integration than our
> relatively primitive search server. Our integration with Solr needs to be
> improved so that it's easier to setup and operate.
>
> * Use database-like storage abstraction: this may seem like a serious
> departure from the current architecture, but I don't mean that we should
> switch to an SQL DB ... what this means is that we should provide an option
> to use HBase, as well as the current plain MapFile-s (and perhaps other
> types of DBs, such as Berkeley DB or SQL, if it makes sense) as our storage.
> There is a very promising initial port of Nutch to HBase, which is currently
> closely integrated with HBase API (which is both good and bad) - it provides
> several improvements over our current storage, so I think it's worth using
> as the new default, but let's see if we can make it more abstract.
>
> * Plugins: the initial OSGI port looks good, but I'm not sure yet at this
> moment if the benefits of OSGI outweigh the cost of this change ...
>
> * Shard management: this is currently an Achilles' heel of Nutch, where
> users are left on their own ... If we switch to using HBase then at least on
> the crawling side the shard management will become much easier. This still
> leaves the problem of deploying new content to search server(s). The
> candidate framework for this side of the shard management is Katta + patches
> provided by Ted Dunning (see ???). If we switch to using Solr we would have
> to  also use the Katta / Solr integration, and perhaps Solr/Hadoop
> integration as well. This is a complex mix of half-ready components that
> needs to be well thought-through ...
>
> * Crawler Commons: during our Crawler MeetUp all representatives agreed
> that we should collect a few components that are nearly the same across all
> projects and collaborate on their development, and use them as an external
> dependency. The candidate components are:
>
>  - robots.txt parsing
>  - URL filtering and normalization
>  - page signature (fingerprint) implementations
>  - page template detection & removal (aka. main content extraction)
>  - possibly others, like URL redirection tracking, PageRank calculation,
> crawler trap detection etc.
>
> 2. Make Nutch easier to use
> ---------------------------
> This, as you may remember our earlier discussions, begs the question: who
> is the target audience of Nutch?
>
> In my opinion, the main users of Nutch are vertical search engines, and
> this is the audience that we should cater to. There are many reasons for
> this:
>
> - Nutch is too complex and too heavy for those that need to crawl up to a
> few thousand pages. Now that the Droids project exists it's probably not
> worth the effort to attempt a complete re-design of Nutch so that it fits
> the need of this group - Nutch is based on map-reduce, and it's not likely
> we would want to change that, so this means there will always be a
> significant overhead for small jobs. I'm not saying we should not make Nutch
> easier to use, but for small crawls Nutch is an overkill. Also, in many
> cases these users don't realize that they don't do any frontier discovery
> and expansion, and what they really need is Solr.
>
> - at the other end of the spectrum, there are very very few companies that
> want to do a wide large web-scale crawling - this is costly, and requires a
> solid business plan and serious funding. These users are prepared anyway to
> spend significant effort on customizations and problem-solving, or they may
> want to use only some parts of Nutch. Often they are also not too eager to
> contribute back to the project - either because of their proprietary nature
> or because their customizations are not useful for general audience.
>
> The remaining group is interested in medium-size, high quality crawling
> (focused, with good spam & junk controls). Which is either an enterprise
> search or a vertical search. We should make Nutch an attractive platform for
> such users, and we should discuss what this entails. Also, if we refactor
> Nutch in the way I described above, it will be easier for such users to
> contribute back to Nutch and other related projects.
>
> 3. Provide a platform for solving the really interesting issues
> ---------------------------------------------------------------
> Nutch has many bits and pieces that implement really smart algorithms and
> heuristics to solve difficult issues that occur in crawling. The problem is
> that they are often well hidden and poorly documented, and their interaction
> with the rest of the system is far from obvious. Sometimes this is related
> to premature performance optimizations, in other cases this is just a poorly
> abstracted design. Examples would include the OPIC scoring, meta-tags &
> metadata handling, deduplication, redirection handling, etc.
>
> Even though these components are usually implemented as plugins, this lack
> of transparency and poor design makes it difficult to experiment with Nutch.
> I believe that improving this area will result in many more users
> contributing back to the project, both from business and from academia.
>
> And there are quite a few interesting challenges to solve:
>
> * crawl scheduling, i.e. determining the order and composition of
> fetchlists to maximize the crawling speed.
>
> * spam & junk detection (I won't go into details on this, there are tons of
> literature on the subject)
>
> * crawler trap handling (e.g. the classic calendar page that generates
> infinite number of pages).
>
> * enterprise-specific ranking and scoring. This includes users' feedback
> (explicit and implicit, e.g. click-throughs)
>
> * pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)
>
> * near-duplicate detection, and closely related issue of extraction of the
> main content from a templated page.
>
> * URL aliasing (e.g. www.a.com == a.com == a.com/index.html ==
> a.com/default.asp), and what happens with inlinks to such aliased pages.
> Also related to this is the problem of temporary/permanent redirects and
> complete mirrors.
>
> Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an
> attractive platform to develop and experiment with such components.
>
> -----------------
> Briefly ;) that's what comes to my mind when I think about the future of
> Nutch. I invite you all to share your thoughts and suggestions!
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: s...@profound.in
http://www.profound.in

Reply via email to