Re: Empty LinkDB after invertlinks

lewis john mcgibbney Tue, 23 Aug 2011 07:44:35 -0700

Hi

Small suggestion, but I do not see any -dir argument passed alongside your
initial invertlinks command. I understand that you have multiple segment
directories, which have been fetched over a recent number of days, and that
the output would also suggest the process was properly executed, however I
have never used the command without the -dir option (as it has always worked
for me), therefore I can only suggest that this may be the problem.




On Tue, Aug 23, 2011 at 3:29 PM, Marek Bachmann <[email protected]>wrote:

> Hi Markus,
>
> thank you for the quick reply. I already searched for this Configuration
> error and found:
>
> http://www.mail-archive.com/**[email protected]/**msg15397.html<http://www.mail-archive.com/[email protected]/msg15397.html>
>
> Where they say that "This exception is innocuous - it helps to debug at
> which points in the code the Configuration instances are being created.
> (...)"
>
> I have indeed not much disk space on the machine but it should be enough at
> the moment:
>
> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# df
> -h .
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/vda1              20G  5.9G   15G  30% /home
>
> As I am root and all directories under 
> /home/nutchServer/relaunch_**nutch/runtime/local/bin
> are set to root:root and 755 permissions shouldn't be the problem.
>
> Any further suggestions? :-/
>
> Thank you once again
>
>
>
> Am 23.08.2011 16:10, schrieb Markus Jelsma:
>
>  There are some peculiarities in your log:
>>
>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
>> config()
>>        at org.apache.hadoop.conf.**Configuration.<init>(**
>> Configuration.java:211)
>>        at org.apache.hadoop.conf.**Configuration.<init>(**
>> Configuration.java:198)
>>        at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
>>        at
>> org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
>> LocalJobRunner.java:93)
>>        at
>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
>> LocalJobRunner.java:373)
>>        at
>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>> JobClient.java:800)
>>        at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>> java:730)
>>        at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>> java:1249)
>>        at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
>>        at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
>>        at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>        at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)
>>
>> 2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job:
>> job_local_0002
>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
>> config(config)
>>        at org.apache.hadoop.conf.**Configuration.<init>(**
>> Configuration.java:226)
>>        at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
>>        at org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
>> java:52)
>>        at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>> java:32)
>>        at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
>> java:38)
>>        at
>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>> LocalJobRunner.java:111)
>>
>>
>> Can you check permissions, disk space etc?
>>
>>
>>
>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:
>>
>>> Hey Ho,
>>>
>>> for some reasons the inverlinks command produces an empty linkdb.
>>>
>>> I did:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
>>> LinkDb: starting at 2011-08-23 14:47:21
>>> LinkDb: linkdb: crawl/linkdb
>>> LinkDb: URL normalize: false
>>> LinkDb: URL filter: false
>>> LinkDb: adding segment: crawl/segments/20110817164804
>>> LinkDb: adding segment: crawl/segments/20110817164912
>>> LinkDb: adding segment: crawl/segments/20110817165053
>>> LinkDb: adding segment: crawl/segments/20110817165524
>>> LinkDb: adding segment: crawl/segments/20110817170729
>>> LinkDb: adding segment: crawl/segments/20110817171757
>>> LinkDb: adding segment: crawl/segments/20110817172919
>>> LinkDb: adding segment: crawl/segments/20110819135218
>>> LinkDb: adding segment: crawl/segments/20110819165658
>>> LinkDb: adding segment: crawl/segments/20110819170807
>>> LinkDb: adding segment: crawl/segments/20110819171841
>>> LinkDb: adding segment: crawl/segments/20110819173350
>>> LinkDb: adding segment: crawl/segments/20110822135934
>>> LinkDb: adding segment: crawl/segments/20110822141229
>>> LinkDb: adding segment: crawl/segments/20110822143419
>>> LinkDb: adding segment: crawl/segments/20110822143824
>>> LinkDb: adding segment: crawl/segments/20110822144031
>>> LinkDb: adding segment: crawl/segments/20110822144232
>>> LinkDb: adding segment: crawl/segments/20110822144435
>>> LinkDb: adding segment: crawl/segments/20110822144617
>>> LinkDb: adding segment: crawl/segments/20110822144750
>>> LinkDb: adding segment: crawl/segments/20110822144927
>>> LinkDb: adding segment: crawl/segments/20110822145249
>>> LinkDb: adding segment: crawl/segments/20110822150757
>>> LinkDb: adding segment: crawl/segments/20110822152354
>>> LinkDb: adding segment: crawl/segments/20110822152503
>>> LinkDb: adding segment: crawl/segments/20110822153900
>>> LinkDb: adding segment: crawl/segments/20110822155321
>>> LinkDb: adding segment: crawl/segments/20110822155732
>>> LinkDb: merging with existing linkdb: crawl/linkdb
>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14
>>>
>>> After that:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump
>>> LinkDb dump: starting at 2011-08-23 14:48:26
>>> LinkDb dump: db: crawl/linkdb/
>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01
>>>
>>> And then:
>>>
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
>>> cd
>>> linkdump/
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>> runtime/local/bin/linkdump#
>>> ll
>>> total 0
>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
>>> runtime/local/bin/linkdump#
>>>
>>> As you see, the dump size is 0 byte.
>>>
>>> Unfortunately I have no idea what went wrong.
>>>
>>> I have attached the hadoop.log for the inverlinks process. Perhaps that
>>> helps anybody?
>>>
>>
>>
>


-- 
*Lewis*

Re: Empty LinkDB after invertlinks

Reply via email to