Re: Nutch Crawl a Specific List Of URLs (150K)

2013-12-28 Thread Talat Uyarer
Hi Bin,

You have interesting error. I don't use 1.7 but I can try with screen
command. I believe you will not get same error.

Talat


2013/12/27 Bin Wang 

> Hi,
>
> I have a very specific list of URLs, which is about 140K URLs.
>
> I switch off the `db.update.additions.allowed` so it will not update the
> crawldb... and I was assuming I can feed all the URLs to Nutch, and after
> one round of fetching, it will finish and leave all the raw HTML files in
> the segment folder.
>
> However, after I run this command:
> nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 &
>
> It ended up with a small number of URLs..
> TOTAL urls: 872
> retry 0: 872
> min score: 1.0
> avg score: 1.0
> max score: 1.0
>
> And I double check the log to make sure that every url can pass the filter
> and normalization. And here is the log:
>
> 2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
> urls rejected by filters: 0
> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
> urls injected after normalization and filtering: 139058
> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
>
> I don't know how 140K URLs ended up being 872 in the end...
>
> /usr/bin
>
> --
> AWS ubuntu instance
> Nutch 1.7
> java version "1.6.0_27"
> OpenJDK Runtime Environment (IcedTea6 1.12.6)
> (6b27-1.12.6-1ubuntu0.12.04.4)
> OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


Build failed in Jenkins: Nutch-trunk #2467

2013-12-28 Thread Apache Jenkins Server
See 

--
[...truncated 3432 lines...]

init:
[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-host
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-querystring
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 

[mkdir] Created dir: 

 [copy] Copying 4 files to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: