Re: trouble using nutch server
Hi After a brief look at the source code, it seems you would have to use the following: localhost:8081/job/create { "crawlId":"crawl-01", "type":"INJECT", "confId":"default", "args": {"seedDir":"/.../apache-nutch-2.3/runtime/local/url/"} } I do not know about the documentation of 2.x Sorry for the late reply. Hope it helps :) Regards, Sujen Shah M.S - Computer Science (Class of 2016) University of Southern California +1(213)-820-9169 http://www.linkedin.com/in/sujenshah On Tue, Apr 7, 2015 at 4:53 PM, Mahmoud Gzawi wrote: > Hi. Thanks for the reply. > > Yes. I refered to the documentation of nutch 1.X rest api: > https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI. > > I thought it would be similar. I already had a look to the documentation > of nutch 2.X api, but it's incomplete: > https://wiki.apache.org/nutch/NutchRESTAPI. > > Can you please tell me what is the right settings to use for Inject job in > nutch 2. Is there any documentation for the other nutch jobs. > > > > > On 08/04/2015 01:16, Sujen Shah wrote: > > Hi, > It seems like you are using Nutch 2.x. > And the args you passed looks like the one from the documentation of the > Nutch 1.x REST service. > Could you please tell which documentation did you refer to ? > > Regards, > Sujen Shah > M.S - Computer Science (Class of 2016) > University of Southern California > +1(213)-820-9169 > http://www.linkedin.com/in/sujenshah > > On Tue, Apr 7, 2015 at 3:58 PM, Mahmoud Gzawi > wrote: > >> Hi everyone. >> >> I'm having a trouble creating a job in nutch server. if any one could >> help! >> >> I'm trying to create a job in nutch server and i'm stuck at the begining: >> >> localhost:8081/job/create >> { >> "crawlId":"crawl-01", >> "type":"INJECT", >> "confId":"default", >> "args": {"crawldb":"crawl", >> "url_dir":"/.../apache-nutch-2.3/runtime/local/url/"} >> } >> >> and here's hadoop log >> >> 2015-04-08 00:37:52,102 INFO api.NutchServer - Starting NutchServer on >> port: 8081 with logging level: INFO ... >> 2015-04-08 00:37:52,137 INFO api.NutchServer - Started NutchServer on >> port 8081 >> 2015-04-08 00:38:31,384 ERROR impl.JobWorker - Cannot run job worker! >> java.lang.NullPointerException >> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:207) >> at org.apache.nutch.api.impl.JobWorker.run(JobWorker.java:64) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> >> Can anyone tell me what i'm doing wrong! >> Thanks in advance. >> > > >
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485299#comment-14485299 ] lufeng commented on NUTCH-1854: --- if we set "fetcher.store.content=false" and "fetcher.parse=false" then the "bin/nutch parse" command will throw exception to check the input content directory exist. So I think why we need this parameter because something we set the "fetcher.parse" to true and don't want to store the content because of slow disk or not much disk space. So I think we can remove this parameter of "fetcher.store.content" and if the parameter of "fetcher.parse=true" we don't store the page content. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485297#comment-14485297 ] Sebastian Nagel commented on NUTCH-1247: Close this issue? With NUTCH-578 and NUTCH-1245 resolved this problem should not appear any more. And we surely do not need more than 127 retries until we set a page to gone! > CrawlDatum.retries should be int > > > Key: NUTCH-1247 > URL: https://issues.apache.org/jira/browse/NUTCH-1247 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4 >Reporter: Markus Jelsma > Fix For: 1.11 > > Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B > > > CrawlDatum.retries is a byte and goes bad with larger values. > 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 > 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Issue with Nutch 2.3 and solr 4.9.1 on crawling website: NoSuchElementException
Hi Nutch, I am getting following error when I used Nutch 2.3 to crawl website but everything works fine with Nutch 1.9, using solr 4.9.1 java.util.NoSuchElementException at java.util.TreeMap.key(TreeMap.java:1221) at java.util.TreeMap.firstKey(TreeMap.java:285) at org.apache.gora.memory.store.MemStore.execute(MemStore.java:125) at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73) at org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:68) at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:110) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Regards, Suman Saurabh http://in.linkedin.com/in/ssumansaurabh https://github.com/sumansaurabh http://suman-saurabh.appspot.com/
[jira] [Commented] (NUTCH-1981) Upgrade icu4j
[ https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485176#comment-14485176 ] Marko Asplund commented on NUTCH-1981: -- Not that I know of. It's probably a good idea to upgrade to the latest version. I used 55.1 just because that's the version we're using at the moment and seems to be working reliably. > Upgrade icu4j > - > > Key: NUTCH-1981 > URL: https://issues.apache.org/jira/browse/NUTCH-1981 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 2.3, 1.9 >Reporter: Marko Asplund > Fix For: 2.4, 1.11 > > Attachments: NUTCH-1981.patch > > > The icu4j version from 2009 is causing some compatibility issues with custom > plugins we're developing. Please upgrade to a more recent version. > I'm attaching a patch to this issue. Nutch builds and all tests pass without > source code changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1981) Upgrade icu4j
[ https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marko Asplund updated NUTCH-1981: - Summary: Upgrade icu4j (was: Upgrade icu4j to version 51.1) > Upgrade icu4j > - > > Key: NUTCH-1981 > URL: https://issues.apache.org/jira/browse/NUTCH-1981 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 2.3, 1.9 >Reporter: Marko Asplund > Fix For: 2.4, 1.11 > > Attachments: NUTCH-1981.patch > > > The icu4j version from 2009 is causing some compatibility issues with custom > plugins we're developing. Please upgrade to a more recent version. > I'm attaching a patch to this issue. Nutch builds and all tests pass without > source code changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1981) Upgrade icu4j to version 51.1
[ https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485157#comment-14485157 ] Sebastian Nagel commented on NUTCH-1981: There should be no problem to upgrade the dependency. Is there a reason why the two year old 51.1 is taken instead of the recent 55.1? > Upgrade icu4j to version 51.1 > - > > Key: NUTCH-1981 > URL: https://issues.apache.org/jira/browse/NUTCH-1981 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 2.3, 1.9 >Reporter: Marko Asplund > Fix For: 2.4, 1.11 > > Attachments: NUTCH-1981.patch > > > The icu4j version from 2009 is causing some compatibility issues with custom > plugins we're developing. Please upgrade to a more recent version. > I'm attaching a patch to this issue. Nutch builds and all tests pass without > source code changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1981) Upgrade icu4j to version 51.1
[ https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1981: --- Fix Version/s: 1.11 2.4 > Upgrade icu4j to version 51.1 > - > > Key: NUTCH-1981 > URL: https://issues.apache.org/jira/browse/NUTCH-1981 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 2.3, 1.9 >Reporter: Marko Asplund > Fix For: 2.4, 1.11 > > Attachments: NUTCH-1981.patch > > > The icu4j version from 2009 is causing some compatibility issues with custom > plugins we're developing. Please upgrade to a more recent version. > I'm attaching a patch to this issue. Nutch builds and all tests pass without > source code changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)