Fetcher for constrained crawls

2005-08-22 Thread Kelvin Tan
I've been working on some changes to crawling to facilitate its use as a 
non-whole-web crawler, and would like to gauge interest on this list about 
including it somewhere in the Nutch repo, hopefully before the map-red brance 
gets merged in.

It is basically a partial re-write of the whole fetching mechanism, borrowing 
large chunks of code here and there.

Features include:
- Customizable seed inputs, i.e. seed a crawl from a file, database, Nutch 
FetchList, etc
- Customizable crawl scopes, e.g. crawl the seed URLs and only the urls within 
their domains. (this can already be manually accomplished with RegexURLFilter, 
but what if there are 200,000 seed URLs?), or crawl seed url domains + 1 
external link (not possible with current filter mechanism)
- Online fetchlist building (as opposed to Nutch’s offline method), and 
customizable strategies for building a fetchlist. The default implementation 
gives priority to hosts with a larger number of pages to crawl. Note that 
offline fetchlist building is ok too.
- Runs continuously until all links are crawled
- Customizable fetch output mechanisms, like output to file, to WebDB, or even 
not at all (if we’re just implementing a link-checker, for example)
- Fully utilizes HTTP 1.1 connection persistence and request pipelining

It is fully compatible with Nutch as it is, i.e. given a Nutch fetchlist, the 
new crawler can produce a Nutch segment. However, if you don’t need that at 
all, and are just interested in Nutch as a crawler, then that’s ok too!

It is a drop-in replacement for the Nutch crawler, and compiles with the 
recently released 0.7 jar.

Some disclaimers:
It was never designed to be a superset replacement for the Nutch crawler. 
Rather, it is tailored to fairly specific requirements of what I believe is 
called constrained crawling. It uses Spring Framework (for easy customization 
of implementation classes) and JDK 5 features (occasional new loop syntax, 
autoboxing, generics, etc). These 2 points speeded up dev. but probably make it 
a untasty Nutch acquisition.. ;-) But it shouldn't be tough to do something 
about that..

One of the areas the Nutch Crawler can use with improvement is in the fact that 
its really difficult to extend and customize. With the addition of interfaces 
and beans, its possible for developers to develop their own mechanism for 
fetchlist prioritization, or use a B-Tree as the backing implementation of the 
database of crawled URLs. I'm using Spring to make it easy to change 
implementation, and make loose coupling easy..

There are some places where existing Nutch functionality is duplicated in some 
way to allow for slight modifications as opposed to patching the Nutch classes. 
The rationale behind this approach was to simplify integration - much easier to 
have Our Crawler as a separate jar which depends on the Nutch jar. Furthermore 
if it doesn't get accepted into Nutch, no rewriting or patching of Nutch 
sources needs to be done.

Its my belief that if you're using Nutch for anything but whole-web crawling 
and need to make even small changes to the way the crawling is performed, 
you'll find Our Crawler helpful.

I consider current code as beta quality. I've run it on smallish crawls (200k+ 
URLs) and things seem to be working ok, but nowhere near production quality.

Some related blog entries:

Improving Nutch for constrained crawls
http://www.supermind.org/index.php?p=274

Reflections on modifying the Nutch crawler
http://www.supermind.org/index.php?p=283

Limitations of OC
http://www.supermind.org/index.php?p=284

Even if we decide not to include in Nutch repo, the code will still be released 
under APL. I'm in the process of adding abit more documentation, and a shell 
script for running, and will release the files over the next couple days.

Cheers,
Kelvin

http://www.supermind.org



Re: Fetcher for constrained crawls

2005-08-22 Thread Kelvin Tan
Sorry, realized I needed to qualify: plugin framework is nice, but I mean 
customizing non-extension point fetcher behaviour.

k

On Tue, 23 Aug 2005 00:02:26 -0400, Kelvin Tan wrote:
> One of the areas the Nutch Crawler can use with improvement is in
> the fact that its really difficult to extend and customize.




Re: (NUTCH-84) Fetcher for constrained crawls

2005-08-24 Thread Kelvin Tan
Instructions for running:

1. Change build.properties to your location of nutch

2. ant nutch-deploy
Ant copies relevant jars to nutch_home/lib, and beans.xml to nutch_home/conf

3. Edit nutch_home/conf/beans.xml (the Spring framework conf file)
Important values to change are obviously the ones involving file paths. In 
particular, change the location of the file for seeding the crawl.
Nutch-style one URL per line please.

Look also at SizeConstrainedFLFilter. This limits the size of the crawl to the 
number you put there (great for test runs, but not so hot for whole-web crawls).

4. Fire up cygwin or bash.
Go to nutch home, and run
./nutch org.supermind.crawl.CrawlTool

This should start the crawler (and hopefully it'll run till completion!)

For a space of a _week_ or so, its ok to mail me privately if you need help 
getting things up and running: kelvin at supermind dot org.

Javadocs included in the zip and also available online at 
http://www.supermind.org/code/oc/api/index.html.

Again, I'd like to emphasize the beta nature of the code, so please be 
forgiving.

Cheers,
k

On Thu, 25 Aug 2005 01:06:09 +0200 (CEST), Kelvin Tan (JIRA) wrote:
>�[ http://issues.apache.org/jira/browse/NUTCH-84?page=all ]
>
>�Kelvin Tan updated NUTCH-84:
>�
>
>�Attachment: oc-0.3.zip
>
>�Javadocs included in the zip and also available online at
>�http://www.supermind.org/code/oc/api/index.html.
>
>�Code is released under APL, but I've also included the Spring jars
>�you'll need to run it.
>
>>�Fetcher for constrained crawls
>>�--
>>
>>�Key: NUTCH-84
>>�URL: http://issues.apache.org/jira/browse/NUTCH-84 Project: Nutch
>>�Type: Improvement Components: fetcher Versions: 0.7 Reporter:
>>�Kelvin Tan Priority: Minor Attachments: oc-0.3.zip
>>
>>�As posted http://marc.theaimsgroup.com/?l=nutch-
>>�developers&m=112476980602585&w=2




Re: Implementation of (NUTCH-84) Fetcher for constrained crawls

2005-08-26 Thread Kelvin Tan
Wang Wen is having some build problems with the code I uploaded to Jira. I'm 
wondering if anyone else is facing the same problems?

kelvin

On Thu, 25 Aug 2005 18:03:30 -0700, Wang Wen wrote:
> Hi Kelvin:
>
> It works a bit, but gives me syntax error;
>
> "
> Buildfile: build.xml
>
> compile:
> [javac] Compiling 50 source files to
> E:\programs\cygwin\home\fji\versionControl\nutch_V07_OC_test\nutch\OC3
> 1\classes [javac]
> E:\programs\cygwin\home\fji\versionControl\nutch_V07_OC_test\nutch\OC3
> 1\src\java\org\supermind\crawl\CrawlSeedSource.java:21:
>  expected
> [javac]   Iterator getSeedURLs() throws IOException;
>
> ...
>
> "
>
> Any idea that I can do more work on that?
>
> thanks,
>
> Wang,
>
>
>> From: Kelvin Tan <[EMAIL PROTECTED]>
>> To: Wang Wen <[EMAIL PROTECTED]>
>> Subject: Re: Implementation of (NUTCH-84) Fetcher for constrained
>> crawls Date: Thu, 25 Aug 2005 20:49:27 -0400
>>
>> Hey Wang, I'm not really sure why javac is tripping up. How about
>> trying removing the target="1.5" attribute?
>>
>> So it should look like
>>
>> > deprecation="false" optimize="false" failonerror="true"> > path="${src.dir}"/>  
>>
>> k
>>
>> On Thu, 25 Aug 2005 17:35:59 -0700, Wang Wen wrote:
>>
>>> Hi Kelvin:
>>>
>>> I see what you mean, and I commented out the line for unzip jar
>>>  case. See attached file.
>>>
>>> But still get error;
>>>
>>> I run ant nutch-deploy under nutch root/OC31/, is it correct?
>>> Must  some stupid error, sorry that is first time I run patch
>>> in nutch;
>>>
>>> thanks,
>>>
>>> Wang,
>>>
>>> ===  $ ant
>>> nutch-deploy  Buildfile: build.xml
>>>
>>> compile:
>>> [javac] Compiling 50 source files to
>>> E:\programs\cygwin\home\fji\versionCont
>>> rol\nutch_V07_OC_test\nutch\OC31\classes
>>> [javac] javac: invalid target release: 1.5
>>> [javac] Usage: javac  
>>> [javac] where possible options include:
>>> [javac]   -g    Generate all debugging info
>>>  [javac]   -g:none   Generate no debugging info
>>>  [javac]   -g:{lines,vars,source}    Generate only some
>>> debugging  info [javac]   -nowarn   Generate no
>>> warnings  [javac]   -verbose  Output messages
>>> about what the  compiler is doing
>>> [javac]   -deprecation  Output source locations
>>> where  deprecated APIs are used
>>> [javac]   -classpath  Specify where to find user  
>>> class files
>>> [javac]   -sourcepath Specify where to find input
>>>  source files
>>>
>>> [javac]   -bootclasspath  Override location of
>>> bootstrap  class fil es
>>> [javac]   -extdirs    Override location of
>>> installed  extension s
>>> [javac]   -d Specify where to place  
>>> generated class f iles
>>> [javac]   -encoding   Specify character encoding
>>> used  by sourc e files
>>> [javac]   -source  Provide source
>>> compatibility  with specif ied release
>>> [javac]   -target  Generate class files for  
>>> specific VM ver sion
>>> [javac]   -help Print a synopsis of
>>> standard  options
>>>
>>>
>>> BUILD FAILED
>>> E:\programs\cygwin\home\fji\versionControl\nutch_V07_OC_test\nutch\OC3
>>>   1\build.xm l:24: Compile failed; see the compiler error
>>> output for  details.
>>>
>>> Total time: 1 second
>>>
>>>[EMAIL PROTECTED] ~/versionControl/nutch_V07_OC_test/nutch/OC31  $
>>> java -version  java version "1.5.0_04"
>>> Java(TM) 2 Runtime Environment, Standard Edition (build
>>> 1.5.0_04-  b05) Java HotSpot(TM) Client VM (build 1.5.0_04-b05,
>>> mixed mode,  sharing)
>>>
>>>[EMAIL PROTECTED] ~/versionControl/nutch_V07_OC_test/nutch/OC31
>>>
>>> _  Exp
>>> ress yourself instantly with MSN Messenger! Download today -  
>>> it's FREE! http://messenger.msn.click-  
>>> url.com/go/onm00200471ave/direct/01/
>>
>
> _
> Don’t just search. Find. Check out the new MSN Search!
> http://search.msn.click-url.com/go/onm00200636ave/direct/01/




Re: Implementation of (NUTCH-84) Fetcher for constrained crawls

2005-08-26 Thread Kelvin Tan
When creating the 0.3.1 zip, I mistakenly included some changes I was making at 
that time to the code. I'm working on creating a 0.3.2 right now with these 
corrections.

Alternatively, just use 0.3.zip with the 0.3.1 build.xml and build.properties.

Thanks, and sorry about the oversight.

k

On Fri, 26 Aug 2005 10:54:31 +0200, Piotr Kosiorowski wrote:
> I suspected he uses JDK 1.4 and as when he executes 'java -version'
> it print 1.5 I think he has both JDK installed and PATH is updated
> for 1.5 but JAVA_HOME (used by ant) is set to the 1.4 one. I did so
> in my installation and get indentical results.
> But having said that I am still not able to compile it using JDK
> 1.5 - and looking at the source code it should not compile in my
> opinion: Error is:
> [javac] Compiling 50 source files to D:\oc\classes [javac]
> D:\oc\src\java\org\supermind\crawl\HostQueue.java:70: non-static
> var iable maxPagesPerConnection cannot be referenced from a static
> context [javac]     int number = Math.min(pages.size(),
> Fetcher.maxPagesPerConnectio n);
> [javac]                                                ^
>
> And when you look at Fetcher.java:
> protected int maxPagesPerConnection = 5;
>
> So it should not compile in my opinion.
> Am I missing sth?
> Regards
> Piotr
>
>
> On 8/26/05, Kelvin Tan <[EMAIL PROTECTED]> wrote:
>
>> Wang Wen is having some build problems with the code I uploaded
>> to Jira. I'm wondering if anyone else is facing the same problems?
>>
>> kelvin




Re: Implementation of (NUTCH-84) Fetcher for constrained crawls

2005-08-26 Thread Kelvin Tan
Just uploaded 0.3.2.zip. Please use this instead. Sorry about the oversights..

k

On Fri, 26 Aug 2005 12:08:33 -0400, Kelvin Tan wrote:
> When creating the 0.3.1 zip, I mistakenly included some changes I
> was making at that time to the code. I'm working on creating a
> 0.3.2 right now with these corrections.
>
> Alternatively, just use 0.3.zip with the 0.3.1 build.xml and
> build.properties.
>
> Thanks, and sorry about the oversight.
>
> k
>
> On Fri, 26 Aug 2005 10:54:31 +0200, Piotr Kosiorowski wrote:
>
>> I suspected he uses JDK 1.4 and as when he executes 'java -
>> version'  it print 1.5 I think he has both JDK installed and PATH
>> is updated  for 1.5 but JAVA_HOME (used by ant) is set to the 1.4
>> one. I did so  in my installation and get indentical results.
>> But having said that I am still not able to compile it using JDK  
>> 1.5 - and looking at the source code it should not compile in my  
>> opinion: Error is:
>> [javac] Compiling 50 source files to D:\oc\classes [javac]  
>> D:\oc\src\java\org\supermind\crawl\HostQueue.java:70: non-static  
>> var iable maxPagesPerConnection cannot be referenced from a
>> static  context [javac] int number = Math.min(pages.size(),  
>> Fetcher.maxPagesPerConnectio n);
>> [javac]    ^
>>
>> And when you look at Fetcher.java:
>> protected int maxPagesPerConnection = 5;
>>
>> So it should not compile in my opinion.
>> Am I missing sth?
>> Regards
>> Piotr
>>
>>
>> On 8/26/05, Kelvin Tan <[EMAIL PROTECTED]> wrote:
>>
>>> Wang Wen is having some build problems with the code I uploaded
>>>  to Jira. I'm wondering if anyone else is facing the same
>>> problems?
>>>
>>> kelvin




Re: indexing and refetching by using NUTCH-84) Fetcher for constrained crawls

2005-08-27 Thread Kelvin Tan
Hey Michael, did you use the nutch-84 segment location as the argument for the 
respective nutch commands, e.g..

bin/nutch updatedb db 

If intending to integrate with webdb, you'll need to ensure the directory 
structure of the segment output is what Nutch expects, which means
db/segments/

I haven't tried running Nutch with the index created, but when I open the index 
in Luke, everything looks correct. Let me know if you still have problems.

To customize how domains are crawled, you'll want to write a ScopeFilter. Take 
a look at SameParentHostFLFilter for an example. When I have some time later 
today, I'll see if I can hack something quick to limit crawling by depth..

k

On Sat, 27 Aug 2005 09:01:28 -0700 (PDT), Michael Ji wrote:
>
> Hi there,
>
> I installed Nutch-84 patch in Nutch 07 and run patch test script
> successfully with my seeds.txt.
>
> It created /segment/ with sub-directories of "content", "fetcher",
> "parse_data" and "parse_text".
>
> Followings are the issues I met and concerning:
>
> 1) Indexing
>
> Then, I run nutch/index for this segment successfully. But there is
> no result (hits) returned in searching after I launch tomcat box.
>
> 2) Domain control
>
> As I understood, this patch is for control domain crawling. Seems
> we can define the fetching depth for both domain site and
> outlinking site by ourself. If so, where these parameters I can
> input?
>
> 3) Refetching
>
> Based on the fetched data, I tried several things, such as, running
> nutch/updatedb, nutch/gengerate, nutch/fetcher. Seems not working.
>
> Is there a scenario that I can adopt this patch for refetching
> purpose?
>
> thanks,
>
> Michael Ji,
>
>
>  Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs




Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

2005-08-27 Thread Kelvin Tan
If we add a depth field to ScheduledURL, then controlling the depth of a crawl 
is simple:

/**
 * Limits a crawl to a fixed depth. Seeds are depth 0.
 */
public class DepthFLFilter implements ScopeFilter {
  private int max;

  public synchronized int filter(FetchListScope.Input input) {
return input.parent.depth < max ? ALLOW : REJECT;
  }

  public void setMax(int max) {
this.max = max;
  }
}

On Sat, 27 Aug 2005 13:19:18 -0400, Kelvin Tan wrote:
> Hey Michael, did you use the nutch-84 segment location as the
> argument for the respective nutch commands, e.g..
>
> bin/nutch updatedb db 
>
> If intending to integrate with webdb, you'll need to ensure the
> directory structure of the segment output is what Nutch expects,
> which means
> db/segments/
>
> I haven't tried running Nutch with the index created, but when I
> open the index in Luke, everything looks correct. Let me know if
> you still have problems.
>
> To customize how domains are crawled, you'll want to write a
> ScopeFilter. Take a look at SameParentHostFLFilter for an example.
> When I have some time later today, I'll see if I can hack something
> quick to limit crawling by depth..
>
> k
>
> On Sat, 27 Aug 2005 09:01:28 -0700 (PDT), Michael Ji wrote:
>
>> Hi there,
>>
>> I installed Nutch-84 patch in Nutch 07 and run patch test script  
>> successfully with my seeds.txt.
>>
>> It created /segment/ with sub-directories of "content",
>> "fetcher",  "parse_data" and "parse_text".
>>
>> Followings are the issues I met and concerning:
>>
>> 1) Indexing
>>
>> Then, I run nutch/index for this segment successfully. But there
>> is  no result (hits) returned in searching after I launch tomcat
>> box.
>>
>> 2) Domain control
>>
>> As I understood, this patch is for control domain crawling. Seems
>>  we can define the fetching depth for both domain site and  
>> outlinking site by ourself. If so, where these parameters I can  
>> input?
>>
>> 3) Refetching
>>
>> Based on the fetched data, I tried several things, such as,
>> running  nutch/updatedb, nutch/gengerate, nutch/fetcher. Seems
>> not working.
>>
>> Is there a scenario that I can adopt this patch for refetching  
>> purpose?
>>
>> thanks,
>>
>> Michael Ji,
>>
>>
>>  Start your
>> day  with Yahoo! - make it your home page
>> http://www.yahoo.com/r/hs




Re: Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

2005-08-27 Thread Kelvin Tan
Hey Michael, please see inline..

On Sat, 27 Aug 2005 14:54:21 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> Thanks your hint for depth control. I will try it tonight and will
> let you know the result.
>
> I guess the design of patch-84 is to become an independent crawler
> by itself. Is it true?
>

nutch-84 was designed to be a standalone focused/constrained crawler.

> So, it will replace the commands of "nutch/admintool create..,
> nutch/generate, nutch/updateda", etc, by only using OC APIs.
>

No. Nutch is more than just the crawler (webdb, analyzer, etc). nutch-84 was 
created for people who want a crawler (to use with lucene) but not necessarily 
the whole nutch infrastructure. Take for instance the fact that the nutch query 
language is slighty different from Lucene's. By extending this simple 
PostFetchProcessor below, you can easily just crawl and add documents directly 
to a lucene index without needing to use WebDB (or the bin/nutch/index command).

public abstract class LucenePostFetchProcessor implements PostFetchProcessor{
  private String index;
  private IndexWriter writer;
  private boolean overwrite;
  private Analyzer analyzer;

  public void process(FetcherOutput fo, Content content, Parse parse)
  throws IOException {
if(writer == null) initWriter();
writeDocument(fo, content, parse);
  }

  protected abstract void writeDocument(
  FetcherOutput fo, Content content, Parse parse);

  private void initWriter() throws IOException {
writer = new IndexWriter(index, analyzer, overwrite);
  }

  public void close() throws IOException {
if(writer != null) writer.close();
  }

  // setters
}


> I mean, OC can form its own fetch list for the fetching next round,
> for example. Only the fetched result needs to be indexed and merged.
>

If what you mean is that OC will run continuously until there are no more URLs 
to fetch, you are correct. Unfortunately, until we deal with the problem of bot 
traps, I don't think this is a good idea for a production environment.

HTH,
k



Re: Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

2005-08-27 Thread Kelvin Tan
Hey Michael,

On Sat, 27 Aug 2005 21:13:27 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> I started to dig into the code and data structure of OC;
>
> Just curious questions:
>
> 1) Where OC forms a fetchlist? I didn't see it in segment/ of OC
> created.

Crawls are seeded using a CrawlSeedSource. These urls are injected into the 
respective FetcherThread's fetchlists.

After the initial seed, URLs are added to fetchlists from the parsed pages 
outlinks. OC builds the fetchlist online vs nutch's offline fetchlist building.

>
> 2) In OC, FetchList is organized in such a way of URLs sequence per
> host. Then, what if there are too many hosts, saying ten thousand.
> How about I/O performance concern? Will it exceed the system open-
> file limitation?
>

The concern, if any, would be memory, not i/o, because DefaultFetchList 
currently stores everything in memory. Still, its an interface, and simple for 
someone to implement a fetchlist that has bounds on memory, persisting to disk 
where appropriate.

k



Re: crawling ability of NUTCH-84

2005-08-28 Thread Kelvin Tan
Michael,

On Sun, 28 Aug 2005 08:31:29 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> Just a curious question.
>
> As I know, the goal of nutch global crawling ability will reach 10
> billions page based on implementation of map reduced.
>
> OC, seeming to fall in the middle, is for control industry domain
> crawling. How many sites is its' goal?dealing with couple of
> thousand sites?
>

The goal of OC is to facilitate focused crawling. I see at least 2 kinds of 
focused crawling:

1. Whole-web focused crawling, like spidering all pages/sites on WWW related to 
research publications on leukemia
2. Crawling a given list of URLs/sites compreh, like Teleport Pro.

Although OC was designed with scenario #2 in mind, I think it would also be 
suitable for scenario #1.

If size of crawl is a concern, I don't think it'd be difficult to build in a 
throttling mechanism to ensure that the in-memory data structures don't get too 
large.
I've been travelling around alot lately, so I haven't had a chance to test OC 
on crawls > 200k pages.


> I believe the importance for industry domain crawling is in-time
> updating. So identifying content of fetched page and saving post-
> parsing time is critical.
>

I agree. High on my todo list are:

1. Refetch using if-modified-since
2. Using an alternate link extractor if nekohtml ends up to be a bottleneck
3. Parsing downloaded pages to extract data into databases to facilitate 
aggregation, like defining a site template to map HTML pages to database 
columns (think job sites for example).
4. Move post-fetch processing into a separate thread if it turns out to be a 
bottleneck

k



Re: bot-traps and refetching

2005-08-28 Thread Kelvin Tan
Michael,

On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> 1) bot-traps problem for OC
>
> If we have a crawling depth for each starting host, it seems that
> the crawling will be finalized in the end ( we can decrement depth
> value in each time the outlink falls in same host domain).
>
> Let me know if my thought is wrong.
>

Correct. Limiting crawls by depth is probably the simplest way of avoiding 
death by bot-traps. There are other methods though, like assigning credits to 
hosts and adapting fetchlist scheduling according to credit usage, or flagging 
recurring path elements as suspect.

> 2) refetching
>
> If OC's fetchlist is online (memory residence), the next time
> refetch we have to restart from seeds.txt once again. Is it right?
>

Maybe with the current implementation. But if you Implement a CrawlSeedSource 
that reads in the FetcherOutput directory in the Nutch segment, then you can 
seed a crawl using what's already been fetched.


> 3) page content checking
>
> In OC API, I found an API WebDBContentSeenFilter, who uses Nutch
> webdb data structure to see if the fetched page content has been
> seen before. That means, we have to use Nutch to create a webdb
> (maybe nutch/updatedb) in order to support this function. Is it
> right?

Exactly right.

k




Re: controlled depth crawling

2005-08-28 Thread Kelvin Tan
Hey Michael, I don't think that would work, because every link on a single page 
would be decrementing its parent depth.

Instead, I would stick to the DepthFLFilter I provided, and changed 
ScheduledURL's ctor to

public ScheduledURL(ScheduledURL parent, URL url) {
this.id = assignId();
this.seedIndex = parent.seedIndex;
this.parentId = parent.id;
this.depth = parent.depth + 1;
this.url = url;
  }

Then in beans.xml, declare DepthFLFilter as a bean, and set the "max" property 
to 5.

You can even have a more fine-grained control by making a FLFilter that allows 
you to specify a host and maxDepth, and if a host is not declared, then the 
default depth is used. Something like


  20


  
www.nutch.org
7
  
  
www.apache.org
2
  

  


(formatting is probably going to end up warped).

See what I mean?

k

On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>
> Hi Kelvin:
>
> I tried to implement controlled depth crawling based on your Nutch-
> 84 and the discussion we had before.
>
> 1. In DepthFLFilter Class,
>
> I did a bit modification
> "
> public synchronized int filter(FetchListScope.Input input) {
> input.parent.decrementDepth();
> return input.parent.depth >= 0 ? ALLOW : REJECT; } "
>
> 2 In ScheduledURL Class
> add one member variable and one member function " public int depth;
>
> public void decrementDepth() {
> depth --;
> }
> "
>
> 3 Then
>
> we need an initial depth for each domain; for the initial testing;
> I can set a default value 5 for all the site in seeds.txt and for
> each outlink, the value will be 1;
>
> In that way, a pretty vertical crawling is done for on-site domain
> while outlink homepage is still visible;
>
> Further more, should we define a depth value for each url in
> seeds.txt?
>
> Did I in the right track?
>
> Thanks,
>
> Michael Ji
>
>
> __
> Yahoo! Mail
> Stay connected, organized, and protected. Take the tour:
> http://tour.mail.yahoo.com/mailtour.html




Re: controlled depth crawling

2005-08-29 Thread Kelvin Tan
Michael, you don't need to modify FetcherThread at all.
 
Declare DepthFLFilter in beans.xml within the fetchlist scope filter list:
 

  


  20

  


That's all you need to do.
 
k

On Mon, 29 Aug 2005 17:18:09 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> I see your idea and agree with you.
>
> Then, I guess the filter will apply in
>
> FetcherThread.java
> with lines of
> "
> if ( fetchListScope.isInScope(flScopeIn) &
> depthFLFilter.filter(flScopeIn) ) "
>
> Am I right?
>
> I am in the business trip this week. Hard to squeeze time to do
> testing and developing. But I will keep you updated.
>
> thanks,
>
> Micheal,
>
>
> --- Kelvin Tan <[EMAIL PROTECTED]> wrote:
>
>> Hey Michael, I don't think that would work, because every link on
>> a single page would be decrementing its parent depth.
>>
>> Instead, I would stick to the DepthFLFilter I provided, and
>> changed ScheduledURL's ctor to
>>
>> public ScheduledURL(ScheduledURL parent, URL url) { this.id =
>> assignId();
>> this.seedIndex = parent.seedIndex; this.parentId = parent.id;
>> this.depth = parent.depth + 1; this.url = url; }
>>
>> Then in beans.xml, declare DepthFLFilter as a bean, and set the
>> "max" property to 5.
>>
>> You can even have a more fine-grained control by making a
>> FLFilter that allows you to specify a host and maxDepth, and if a
>> host is not declared, then the default depth is used. Something
>> like
>>
>> >
> class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>
>> > name="defaultMax">20 > name="hosts">  
>> www.nutch.org 7  
>> www.apache.org 2  
>>  
>>
>> (formatting is probably going to end up warped).
>>
>> See what I mean?
>>
>> k
>>
>> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>>
>>>
>>> Hi Kelvin:
>>>
>>> I tried to implement controlled depth crawling
>>>
>> based on your Nutch-
>>> 84 and the discussion we had before.
>>>
>>> 1. In DepthFLFilter Class,
>>>
>>> I did a bit modification
>>> "
>>> public synchronized int
>> filter(FetchListScope.Input input) {
>>
>>> input.parent.decrementDepth();
>>> return input.parent.depth >= 0 ? ALLOW : REJECT; }
>>>
>> "
>>
>>> 2 In ScheduledURL Class
>>> add one member variable and one member function "
>>>
>> public int depth;
>>
>>> public void decrementDepth() {
>>> depth --;
>>> }
>>> "
>>>
>>> 3 Then
>>>
>>> we need an initial depth for each domain; for the
>>>
>> initial testing;
>>> I can set a default value 5 for all the site in
>>>
>> seeds.txt and for
>>> each outlink, the value will be 1;
>>>
>>> In that way, a pretty vertical crawling is done
>>>
>> for on-site domain
>>> while outlink homepage is still visible;
>>>
>>> Further more, should we define a depth value for
>>>
>> each url in
>>> seeds.txt?
>>>
>>> Did I in the right track?
>>>
>>> Thanks,
>>>
>>> Michael Ji
>>>
>>>
>>> __  Yahoo! Mail
>>> Stay connected, organized, and protected. Take the
>>>
>> tour:
>>> http://tour.mail.yahoo.com/mailtour.html
>
>
>  Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs




Re: controlled depth crawling

2005-08-29 Thread Kelvin Tan
Michael, you don't need to modify FetcherThread at all.
 
Declare DepthFLFilter in beans.xml within the fetchlist scope filter list:
 

  


  20

  


That's all you need to do.
 
k

On Mon, 29 Aug 2005 17:18:09 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> I see your idea and agree with you.
>
> Then, I guess the filter will apply in
>
> FetcherThread.java
> with lines of
> "
> if ( fetchListScope.isInScope(flScopeIn) &
> depthFLFilter.filter(flScopeIn) ) "
>
> Am I right?
>
> I am in the business trip this week. Hard to squeeze time to do
> testing and developing. But I will keep you updated.
>
> thanks,
>
> Micheal,
>
>
> --- Kelvin Tan <[EMAIL PROTECTED]> wrote:
>
>> Hey Michael, I don't think that would work, because every link on
>> a single page would be decrementing its parent depth.
>>
>> Instead, I would stick to the DepthFLFilter I provided, and
>> changed ScheduledURL's ctor to
>>
>> public ScheduledURL(ScheduledURL parent, URL url) { this.id =
>> assignId();
>> this.seedIndex = parent.seedIndex; this.parentId = parent.id;
>> this.depth = parent.depth + 1; this.url = url; }
>>
>> Then in beans.xml, declare DepthFLFilter as a bean, and set the
>> "max" property to 5.
>>
>> You can even have a more fine-grained control by making a
>> FLFilter that allows you to specify a host and maxDepth, and if a
>> host is not declared, then the default depth is used. Something
>> like
>>
>> >
> class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>
>> > name="defaultMax">20 > name="hosts">  
>> www.nutch.org 7  
>> www.apache.org 2  
>>  
>>
>> (formatting is probably going to end up warped).
>>
>> See what I mean?
>>
>> k
>>
>> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>>
>>>
>>> Hi Kelvin:
>>>
>>> I tried to implement controlled depth crawling
>>>
>> based on your Nutch-
>>> 84 and the discussion we had before.
>>>
>>> 1. In DepthFLFilter Class,
>>>
>>> I did a bit modification
>>> "
>>> public synchronized int
>> filter(FetchListScope.Input input) {
>>
>>> input.parent.decrementDepth();
>>> return input.parent.depth >= 0 ? ALLOW : REJECT; }
>>>
>> "
>>
>>> 2 In ScheduledURL Class
>>> add one member variable and one member function "
>>>
>> public int depth;
>>
>>> public void decrementDepth() {
>>> depth --;
>>> }
>>> "
>>>
>>> 3 Then
>>>
>>> we need an initial depth for each domain; for the
>>>
>> initial testing;
>>> I can set a default value 5 for all the site in
>>>
>> seeds.txt and for
>>> each outlink, the value will be 1;
>>>
>>> In that way, a pretty vertical crawling is done
>>>
>> for on-site domain
>>> while outlink homepage is still visible;
>>>
>>> Further more, should we define a depth value for
>>>
>> each url in
>>> seeds.txt?
>>>
>>> Did I in the right track?
>>>
>>> Thanks,
>>>
>>> Michael Ji
>>>
>>>
>>> __  Yahoo! Mail
>>> Stay connected, organized, and protected. Take the
>>>
>> tour:
>>> http://tour.mail.yahoo.com/mailtour.html
>
>
>  Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs




Re: merge mapred to trunk

2005-08-31 Thread Kelvin Tan


On Wed, 31 Aug 2005 14:37:54 -0700, Doug Cutting wrote:
>[EMAIL PROTECTED] wrote:
>> I, too, am looking forward to this, but I am wondering what that
>> will do to Kelvin Tan's recent contribution, especially since I
>> saw that both MapReduce and Kelvin's code change how
>> FetchListEntry works.  If merging mapred to trunk means losing
>> Kelvin's changes, then I suggest one of Nutch developers
>> evaluates Kelvin's modifications and, if they are good, commits
>> them to trunk, and then makes the final pre-mapred release (e.g.
>> release-0.8).
>>
>
> It won't lose Kelvin's patch: it will still be a patch to 0.7.
>
> What I worry about is the alternate scenario: that Kelvin & others
> invest a lot of effort making this work with 0.7, while the mapred-
> based code diverges even further.  It would be best if Kelvin's
> patch is ported to the mapred branch sooner rather than later, then
> maintained there.
>
> Doug

Agreed. I have some time in the coming weeks, and will work fulltime to evolve 
the patch to be more compatible with Nutch especially map-red..

k



Event queues vs threads

2005-09-01 Thread Kelvin Tan
I'm toying around with the idea of implementing the fetcher as a series of 
event queues (ala SEDA) instead of with threads. This is done by breaking up 
the fetching operation into a series of stages connected by queues, instead of 
one fetcherthread per task.

The stages I see are:

1. CrawlStarter (url injection)
2. URL filtering and normalizing
3. HttpRequest
4. HttpResponse
5. DB of fetched MD5 hashes
6. DB of fetched URLs
7. Parse and link extraction
8. Output
9. Link/Page Scoring

Each of these stages will be handled in its own thread (except for HTML parsing 
and scoring, which may actually benefit from having multiple threads). With the 
introduction of non-blocking IO, I think threads should be used only where 
parallel computation offers performance advantages.

Breaking up HttpRequest and HttpResponse, will also pave the way for a 
non-blocking HTTP implementation.

A big advantage also arises from a decrease in programmatic complexity (and 
possibly performance). With most of the stages being guaranteed to be 
single-threaded, threading/synchronization issues are dramatically reduced. 
This may not be so evident in the current/map-red fetch code, but because of 
the completely online nature of nutch-84/OC, this does simplify things 
considerably.

I'll need to dig abit more to see how this can be conceptually translated into 
map-reduce, but I imagine its do-able. Perhaps each stage gets mapped then 
reduced?

Any thoughts?



To mapred or not

2005-09-01 Thread Kelvin Tan
Seeing mapred is about to be folded into trunk, 3 questions:

1. Any benchmarks/estimates on when the scalability of map-reduce surpasses its 
overhead/complexity? e.g. with > 10 reduce workers..
2. Will there be an option of a plain vanilla single-box Nutch crawler vs a 
map-reduce version?
3. What are the options for users who don't want to jump onboard map-red? Will 
pre-mapred be actively maintained?

thanks..
k



Re: Event queues vs threads

2005-09-01 Thread Kelvin Tan


On Thu, 01 Sep 2005 09:58:49 -0700, Doug Cutting wrote:
> Kelvin Tan wrote:
>> Each of these stages will be handled in its own thread (except
>> for HTML parsing and scoring, which may actually benefit from
>> having multiple threads). With the introduction of non-blocking
>> IO, I think threads should be used only where parallel
>> computation offers performance advantages.
>>
>> Breaking up HttpRequest and HttpResponse, will also pave the way
>> for a non-blocking HTTP implementation.
>>
> I have never been able to write a async version of things with
> Java's nio that outperforms a threaded version.  In theory it is
> possible, since you can avoid thread switching overheads.  But in
> practice I have found it difficult.
>
> Doug

Interesting. I haven't tried it myself. Do you have any code/benchmarks for 
this? Are you aware of others facing the same problem?

k



Re: To mapred or not

2005-09-01 Thread Kelvin Tan


On Thu, 01 Sep 2005 09:36:19 -0700, Doug Cutting wrote:
> It would be worth considering which features of your constrained
> crawler   could be cast as improvements to Nutch's existing tools
> (e.g., more seed url formats, more output formats, http 1.1, custom
> scopes, etc.) and which require a different control flow (online
> fetchlist building?).   In some cases (e.g., fetch prioritization)
> perhaps a new Plugin should be added to Nutch.

In most cases, it is merely a generalization of what Nutch already has, 
introducing interfaces where appropriate to make it easier to modify behavior. 
I've come to see the importance of making scoring pluggable (essential for 
focused crawling), and also both host-based (current nutch-84) and score-based 
(current nutch) fetch prioritization.

There are some departures which need to be reconciled, in particular the role 
of fetchlists and the way they are built. However, I do not see any major 
incompatibilities between whole-web and focused crawling requirements.

In some cases, though, focused crawling requirements may require extra data to 
be stored, which is not useful for whole-web, for example, storing a url's 
parent and seed url and its depth(essential for crawl scopes).

k



Re: HTTP 1.1

2005-09-17 Thread Kelvin Tan
Hey Earl, the Nutch-84 enhancement suggestion in JIRA does just this. There is 
also support for request pipelining, which rather unfortunately, isn't a good 
idea when working with dynamic sites.

Check out a previous post on this: 
http://marc.theaimsgroup.com/?l=nutch-developers&m=112476980602585&w=2

kelvin

On Fri, 16 Sep 2005 16:49:50 -0700 (PDT), Earl Cahill wrote:
> Maybe way ahead of me here, but it was just hitting me that it
> would be pretty cool to group urls to fetch my host and then
> perhaps use http 1.1 to reuse the connection and save initial
> handshaking overheard. Not a huge deal for a couple hits, but it I
> think it would make sense for large crawls.
>
> Or maybe keep a pool of http connections to the last x sites open
> somewhere and check there first.
>
> Sound reasonable?  Already doing it?  I would be willing to help.
>
> Just a thought.
>
> Earl
>
>
> __ Yahoo! Mail - PC Magazine
> Editors' Choice 2005 http://mail.yahoo.com




No more FetchListEntry in MapReduce branch

2005-10-04 Thread Kelvin Tan
There were some previous discussions on implementing If-Modified-Since during 
the fetching phase by modifying FetchListEntry. Seeing that FetchListEntry is 
no longer used in the MapReduce branch and only the URL string is passed to the 
protocol handlers, I'm wondering if anyone has thoughts on how to work with 
this.

Along the same lines, it appears that implementing a more feature-ful system of 
crawling scopes and filters (ala OC/Nutch-84) requires some form of abstraction 
to be carried around in the map/reduce phases rather than just the URL string. 
For example, for each URL, its seed URL, depth from seed and its parent URL 
needs to be known.

kelvin



Re: clustering strategies

2005-10-15 Thread Kelvin Tan
Earl, for a start, since you're crawling your local network and hammering it is 
not a problem, have you also tried disabling stuff like robots checking, and 
the server wait delay?

On Fri, 14 Oct 2005 17:14:23 -0700 (PDT), Earl Cahill wrote:
> Well, I think strangely, not a lot of interest here.
>
> My main concern is that I am trying to crawl content a hop away,
> and I can't really do it very fast.  Once I start my mapred crawl,
> it spends most of the time map reducing and very little time
> actually getting pages. I made the change doug suggested
> (fetcher.threads.per.host=100, http.max.delays=0), and the crawl
> still goes very slow.  I have several other boxes I can use (two
> "good" boxes, several other boxes), just not sure how best to
> spread the jobs, storage and the like.
>
> Again, I would like to crawl about a million local pages in a night.
>
> Feedback would be appreciated.
>
> Thanks,
> Earl
>
> --- Earl Cahill <[EMAIL PROTECTED]> wrote:
>
>> I think it would be nice to have a few cluster strategies on the
>> wiki.
>>
>> It seems there are at least three separate needs: CPU,
>> storage and bandwidth, and I think the more those could be
>> cleanly spread to different boxes, the better.
>>
>> Guess I am imagining a breakdown that lists, by priority, how
>> things should be broken out.  So someone
>> could look at the list and say, ok, I have three good
>> boxes, I should make the best box do x, the second best do y,
>> etc.  There could also be case studies for
>> how different folks did their own implementations and
>> what their crawl/query times were like.
>>
>> I have a small cluster (up to 15 boxes) and would like
>> to start to play around and see how things go under different
>> strategies.  I also have about a million pages of local content,
>> so I can hammer things pretty
>> hard without even leaving my network.  I know that may
>> not match normal conditions, but it could hopefully remove a
>> variable or two (network latency, slow sites), to keep things
>> simple at least to start.
>>
>> I think it also a decent goal to be able to crawl/index my pages
>> in a night (say eight hours), which would be around 35
>> pages/second.  If that isn't
>> a reasonable goal, I would like to hear why not.
>>
>> For each strategy, we could have a set of confs describing how to
>> set things up.  I can picture a gui
>> which could list box roles (crawler, mapper, whatever)
>> and boxes available.  The users could drag and drop their boxes
>> to roles, and confs could then be generated.  Think it could make
>> for rather easy design/implementation of clusters that could get
>> rather complicated.  I can do drag/drop and interpolate into
>> templates in javascript, so I could envision a rather simple page.
>>
>> Maybe we could even store the cluster setup in xml, and have a
>> script that takes the xml and draws the cluster.  Then when
>> people report slowness or the like, they could also post their
>> cluster setup.
>>
>> I think when users come to nutch, they come with a set
>> of boxes.  I think it would be nice for them to see what has
>> worked for such a set of boxes in the past and be able to easily
>> implement such a strategy. Kind
>> of the one hour from download to spidering vision.
>>
>> Just a few thoughts.
>>
>> Earl
>>
>>
>> __ Yahoo! Mail - PC Magazine
>> Editors' Choice 2005 http://mail.yahoo.com
>
>
> __ Yahoo! Mail - PC Magazine
> Editors' Choice 2005 http://mail.yahoo.com




[jira] Created: (NUTCH-84) Fetcher for constrained crawls

2005-08-24 Thread Kelvin Tan (JIRA)
Fetcher for constrained crawls
--

 Key: NUTCH-84
 URL: http://issues.apache.org/jira/browse/NUTCH-84
 Project: Nutch
Type: Improvement
  Components: fetcher  
Versions: 0.7
Reporter: Kelvin Tan
Priority: Minor


As posted http://marc.theaimsgroup.com/?l=nutch-developers&m=112476980602585&w=2

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-84) Fetcher for constrained crawls

2005-08-24 Thread Kelvin Tan (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-84?page=all ]

Kelvin Tan updated NUTCH-84:


Attachment: oc-0.3.zip

Javadocs included in the zip and also available online at 
http://www.supermind.org/code/oc/api/index.html.

Code is released under APL, but I've also included the Spring jars you'll need 
to run it.

> Fetcher for constrained crawls
> --
>
>  Key: NUTCH-84
>  URL: http://issues.apache.org/jira/browse/NUTCH-84
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7
> Reporter: Kelvin Tan
> Priority: Minor
>  Attachments: oc-0.3.zip
>
> As posted 
> http://marc.theaimsgroup.com/?l=nutch-developers&m=112476980602585&w=2

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-84) Fetcher for constrained crawls

2005-08-25 Thread Kelvin Tan (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-84?page=all ]

Kelvin Tan updated NUTCH-84:


Attachment: oc-0.3.1.zip

Updated build.xml and build.properties so it works both with unpacked 
distributions or SVN copies.

> Fetcher for constrained crawls
> --
>
>  Key: NUTCH-84
>  URL: http://issues.apache.org/jira/browse/NUTCH-84
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7
> Reporter: Kelvin Tan
> Priority: Minor
>  Attachments: oc-0.3.1.zip, oc-0.3.zip
>
> As posted 
> http://marc.theaimsgroup.com/?l=nutch-developers&m=112476980602585&w=2

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-84) Fetcher for constrained crawls

2005-08-26 Thread Kelvin Tan (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-84?page=all ]

Kelvin Tan updated NUTCH-84:


Attachment: oc-0.3.2.zip

Corrected compilation and build snafus.

> Fetcher for constrained crawls
> --
>
>  Key: NUTCH-84
>  URL: http://issues.apache.org/jira/browse/NUTCH-84
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7
> Reporter: Kelvin Tan
> Priority: Minor
>  Attachments: oc-0.3.1.zip, oc-0.3.2.zip, oc-0.3.zip
>
> As posted 
> http://marc.theaimsgroup.com/?l=nutch-developers&m=112476980602585&w=2

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira