Re: trouble using nutch server

2015-04-08 Thread Sujen Shah
Hi
After a brief look at the source code, it seems you would have to use the
following:

localhost:8081/job/create
{
"crawlId":"crawl-01",
"type":"INJECT",
"confId":"default",
"args": {"seedDir":"/.../apache-nutch-2.3/runtime/local/url/"}
}

I do not know about the documentation of 2.x

Sorry for the late reply. Hope it helps :)

Regards,
Sujen Shah
M.S - Computer Science (Class of 2016)
University of Southern California
+1(213)-820-9169
http://www.linkedin.com/in/sujenshah

On Tue, Apr 7, 2015 at 4:53 PM, Mahmoud Gzawi  wrote:

>  Hi. Thanks for the reply.
>
> Yes. I refered to the documentation of nutch 1.X rest api:
> https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI.
>
> I thought it would be similar. I already had a look to the documentation
> of nutch 2.X api, but it's incomplete:
> https://wiki.apache.org/nutch/NutchRESTAPI.
>
> Can you please tell me what is the right settings to use for Inject job in
> nutch 2. Is there any documentation for the other nutch jobs.
>
>
>
>
> On 08/04/2015 01:16, Sujen Shah wrote:
>
> Hi,
> It seems like you are using Nutch 2.x.
> And the args you passed looks like the one from the documentation of the
> Nutch 1.x REST service.
> Could you please tell which documentation did you refer to ?
>
>   Regards,
> Sujen Shah
> M.S - Computer Science (Class of 2016)
> University of Southern California
> +1(213)-820-9169
> http://www.linkedin.com/in/sujenshah
>
> On Tue, Apr 7, 2015 at 3:58 PM, Mahmoud Gzawi 
> wrote:
>
>> Hi everyone.
>>
>> I'm having a trouble creating a job in nutch server. if any one could
>> help!
>>
>> I'm trying to create a job in nutch server and i'm stuck at the begining:
>>
>> localhost:8081/job/create
>> {
>> "crawlId":"crawl-01",
>> "type":"INJECT",
>> "confId":"default",
>> "args": {"crawldb":"crawl",
>> "url_dir":"/.../apache-nutch-2.3/runtime/local/url/"}
>> }
>>
>> and here's hadoop log
>>
>> 2015-04-08 00:37:52,102 INFO  api.NutchServer - Starting NutchServer on
>> port: 8081 with logging level: INFO ...
>> 2015-04-08 00:37:52,137 INFO  api.NutchServer - Started NutchServer on
>> port 8081
>> 2015-04-08 00:38:31,384 ERROR impl.JobWorker - Cannot run job worker!
>> java.lang.NullPointerException
>> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:207)
>> at org.apache.nutch.api.impl.JobWorker.run(JobWorker.java:64)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Can anyone tell me what i'm doing wrong!
>> Thanks in advance.
>>
>
>
>


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-08 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485299#comment-14485299
 ] 

lufeng commented on NUTCH-1854:
---

if we set "fetcher.store.content=false" and "fetcher.parse=false" then the 
"bin/nutch parse" command will throw exception to check the input content 
directory exist. So I think why we need this parameter because something we set 
the "fetcher.parse" to true and don't want to store the content because of slow 
disk or not much disk space. So I think we can remove this parameter of 
"fetcher.store.content" and if the parameter of "fetcher.parse=true" we don't 
store the page content.

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2015-04-08 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485297#comment-14485297
 ] 

Sebastian Nagel commented on NUTCH-1247:


Close this issue? With NUTCH-578 and NUTCH-1245 resolved this problem should 
not appear any more. And we surely do not need more than 127 retries until we 
set a page to gone!

> CrawlDatum.retries should be int
> 
>
> Key: NUTCH-1247
> URL: https://issues.apache.org/jira/browse/NUTCH-1247
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B
>
>
> CrawlDatum.retries is a byte and goes bad with larger values.
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
> 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Issue with Nutch 2.3 and solr 4.9.1 on crawling website: NoSuchElementException

2015-04-08 Thread Suman Saurabh
Hi Nutch,

I am getting following error when I used Nutch 2.3 to crawl website but
everything works fine with Nutch 1.9, using solr 4.9.1

java.util.NoSuchElementException
at java.util.TreeMap.key(TreeMap.java:1221)
at java.util.TreeMap.firstKey(TreeMap.java:285)
at org.apache.gora.memory.store.MemStore.execute(MemStore.java:125)
at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73)
at
org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:68)
at
org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:110)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Regards,
Suman Saurabh
http://in.linkedin.com/in/ssumansaurabh
https://github.com/sumansaurabh
http://suman-saurabh.appspot.com/


[jira] [Commented] (NUTCH-1981) Upgrade icu4j

2015-04-08 Thread Marko Asplund (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485176#comment-14485176
 ] 

Marko Asplund commented on NUTCH-1981:
--

Not that I know of. It's probably a good idea to upgrade to the latest version. 
I used 55.1 just because that's the version we're using at the moment and seems 
to be working reliably.

> Upgrade icu4j
> -
>
> Key: NUTCH-1981
> URL: https://issues.apache.org/jira/browse/NUTCH-1981
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3, 1.9
>Reporter: Marko Asplund
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-1981.patch
>
>
> The icu4j version from 2009 is causing some compatibility issues with custom 
> plugins we're developing. Please upgrade to a more recent version.
> I'm attaching a patch to this issue. Nutch builds and all tests pass without 
> source code changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1981) Upgrade icu4j

2015-04-08 Thread Marko Asplund (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marko Asplund updated NUTCH-1981:
-
Summary: Upgrade icu4j  (was: Upgrade icu4j to version 51.1)

> Upgrade icu4j
> -
>
> Key: NUTCH-1981
> URL: https://issues.apache.org/jira/browse/NUTCH-1981
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3, 1.9
>Reporter: Marko Asplund
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-1981.patch
>
>
> The icu4j version from 2009 is causing some compatibility issues with custom 
> plugins we're developing. Please upgrade to a more recent version.
> I'm attaching a patch to this issue. Nutch builds and all tests pass without 
> source code changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1981) Upgrade icu4j to version 51.1

2015-04-08 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485157#comment-14485157
 ] 

Sebastian Nagel commented on NUTCH-1981:


There should be no problem to upgrade the dependency. Is there a reason why the 
two year old 51.1 is taken instead of the recent 55.1?

> Upgrade icu4j to version 51.1
> -
>
> Key: NUTCH-1981
> URL: https://issues.apache.org/jira/browse/NUTCH-1981
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3, 1.9
>Reporter: Marko Asplund
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-1981.patch
>
>
> The icu4j version from 2009 is causing some compatibility issues with custom 
> plugins we're developing. Please upgrade to a more recent version.
> I'm attaching a patch to this issue. Nutch builds and all tests pass without 
> source code changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1981) Upgrade icu4j to version 51.1

2015-04-08 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1981:
---
Fix Version/s: 1.11
   2.4

> Upgrade icu4j to version 51.1
> -
>
> Key: NUTCH-1981
> URL: https://issues.apache.org/jira/browse/NUTCH-1981
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3, 1.9
>Reporter: Marko Asplund
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-1981.patch
>
>
> The icu4j version from 2009 is causing some compatibility issues with custom 
> plugins we're developing. Please upgrade to a more recent version.
> I'm attaching a patch to this issue. Nutch builds and all tests pass without 
> source code changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)