Build failed in Jenkins: Nutch-trunk #2468

2013-12-29 Thread Apache Jenkins Server
See 

--
[...truncated 3380 lines...]

init:
[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-host
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-querystring
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 

[mkdir] Created dir: 

 [copy] Copying 4 files to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


Build failed in Jenkins: Nutch-nutchgora #865

2013-12-29 Thread Apache Jenkins Server
See 

--
[...truncated 203 lines...]
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection._request(HTTPConnection.java:775)
... 33 more
Caused by: svn: E175002: timed out waiting for server
at 
org.tmatesoft.svn.core.SVNErrorMessage.create(SVNErrorMessage.java:208)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection._request(HTTPConnection.java:514)
... 33 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:385)
at java.net.Socket.connect(Socket.java:546)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:590)
at 
org.tmatesoft.svn.core.internal.util.SVNSocketConnection.run(SVNSocketConnection.java:57)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
... 5 more
java.io.IOException: remote file operation failed: 
 at 
hudson.remoting.Channel@15721a72:ubuntu5
at hudson.FilePath.act(FilePath.java:910)
at hudson.FilePath.act(FilePath.java:887)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:848)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:786)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1414)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:652)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:561)
at hudson.model.Run.execute(Run.java:1678)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:231)
Caused by: java.io.IOException: Failed to check out 
https://svn.apache.org/repos/asf/nutch/branches/2.x
at 
hudson.scm.subversion.CheckoutUpdater$1.perform(CheckoutUpdater.java:110)
at 
hudson.scm.subversion.WorkspaceUpdater$UpdateTask.delegateTo(WorkspaceUpdater.java:161)
at hudson.scm.SubversionSCM$CheckOutTask.perform(SubversionSCM.java:908)
at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:889)
at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:872)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2461)
at hudson.remoting.UserRequest.perform(UserRequest.java:118)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:328)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Caused by: org.tmatesoft.svn.core.SVNException: svn: E175002: OPTIONS 
/repos/asf/nutch/branches/2.x failed
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:388)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:373)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:361)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVConnection.performHttpRequest(DAVConnection.java:707)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVConnection.exchangeCapabilities(DAVConnection.java:627)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVConnection.open(DAVConnection.java:102)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVRepository.openConnection(DAVRepository.java:1020)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVRepository.getLatestRevision(DAVRepository.java:180)
at 
org.tmatesoft.svn.core.internal.wc16.SVNBasicDelegate.getRevisionNumber(SVNBasicDelegate.java:480)
at 
org.tmatesoft.svn.core.internal.wc16.SVNBasicDelegate.getLocations(SVNBasicDelegate.java:833)
at 
org.tmatesoft.svn.core.internal.wc16.SVNBasicDelegate.createRepository(SVNBasicDelegate.java:527)
at 
org.tmatesoft.svn.core.internal.wc16.SVNUpdateClient16.doCheckout(SVNUpdateClient16.java:875)
at 
org.tmatesoft.svn.core.in

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-29 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:


Attachment: NUTCH-1687.patch

add Apache Header
fixed lost tail pointer when deleting

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1687.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-29 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:


Attachment: (was: NUTCH-1687.patch)

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1687.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Nutch Crawl a Specific List Of URLs (150K)

2013-12-29 Thread Tejas Patil
Hi Bin Wang,

>> nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 &
You were creating a new crawldb or reusing some old one ?

Were you running this on a cluster or in local mode ?
Was there any failure due to which the fetch round got aborted ? (see logs
for this).

I would like to reproduce this issue. Will it be possible for you to share
your config files and subset of urls ?

Thanks,
Tejas


On Sat, Dec 28, 2013 at 2:10 AM, Talat Uyarer  wrote:

> Hi Bin,
>
> You have interesting error. I don't use 1.7 but I can try with screen
> command. I believe you will not get same error.
>
> Talat
>
>
> 2013/12/27 Bin Wang 
>
>> Hi,
>>
>> I have a very specific list of URLs, which is about 140K URLs.
>>
>> I switch off the `db.update.additions.allowed` so it will not update the
>> crawldb... and I was assuming I can feed all the URLs to Nutch, and after
>> one round of fetching, it will finish and leave all the raw HTML files in
>> the segment folder.
>>
>> However, after I run this command:
>> nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 &
>>
>> It ended up with a small number of URLs..
>> TOTAL urls: 872
>> retry 0: 872
>> min score: 1.0
>> avg score: 1.0
>> max score: 1.0
>>
>> And I double check the log to make sure that every url can pass the
>> filter and normalization. And here is the log:
>>
>> 2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
>> urls rejected by filters: 0
>> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
>> urls injected after normalization and filtering: 139058
>> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
>> urls into crawl db.
>>
>> I don't know how 140K URLs ended up being 872 in the end...
>>
>> /usr/bin
>>
>> --
>> AWS ubuntu instance
>> Nutch 1.7
>> java version "1.6.0_27"
>> OpenJDK Runtime Environment (IcedTea6 1.12.6)
>> (6b27-1.12.6-1ubuntu0.12.04.4)
>> OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
>>
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>