Re: Empty LinkDB after invertlinks

Marek Bachmann Tue, 23 Aug 2011 08:04:56 -0700

Hi Lewis,

thank you for your suggestion.

Unfortunately this isn't the problem. Actually I have also tried tomerge all segments together and put the one big segment to theinverlinks command. Same (none) effect. :-(

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#./nutch mergesegs crawl/one-seg -dir crawl/segments/

Merging 29 segments to crawl/one-seg/20110823165144

SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321SegmentMerger: addingfile:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732SegmentMerger: using segment data from: content crawl_generatecrawl_fetch crawl_parse parse_data parse_textroot@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm-rf crawl/linkdb/

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/-noNormalize -noFilter

LinkDb: starting at 2011-08-23 17:01:44
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: false
LinkDb: URL filter: false
LinkDb: adding segment: crawl/one-seg/20110823165144
LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#./nutch readlinkdb crawl/linkdb/ -dump linkdump

LinkDb dump: starting at 2011-08-23 17:03:12
LinkDb dump: db: crawl/linkdb/
LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01

root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cdlinkdump/root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#ll

total 0
-rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000


Am 23.08.2011 16:44, schrieb lewis john mcgibbney:

Hi

Small suggestion, but I do not see any -dir argument passed alongside your
initial invertlinks command. I understand that you have multiple segment
directories, which have been fetched over a recent number of days, and that
the output would also suggest the process was properly executed, however I
have never used the command without the -dir option (as it has always worked
for me), therefore I can only suggest that this may be the problem.



On Tue, Aug 23, 2011 at 3:29 PM, Marek Bachmann<[email protected]>wrote:

Hi Markus,

thank you for the quick reply. I already searched for this Configuration
error and found:

http://www.mail-archive.com/**[email protected]/**msg15397.html<http://www.mail-archive.com/[email protected]/msg15397.html>

Where they say that "This exception is innocuous - it helps to debug at
which points in the code the Configuration instances are being created.
(...)"

I have indeed not much disk space on the machine but it should be enough at
the moment:

root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin# df
-h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/vda1              20G  5.9G   15G  30% /home

As I am root and all directories under 
/home/nutchServer/relaunch_**nutch/runtime/local/bin
are set to root:root and 755 permissions shouldn't be the problem.

Any further suggestions? :-/

Thank you once again



Am 23.08.2011 16:10, schrieb Markus Jelsma:

  There are some peculiarities in your log:


2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
config()
        at org.apache.hadoop.conf.**Configuration.<init>(**
Configuration.java:211)
        at org.apache.hadoop.conf.**Configuration.<init>(**
Configuration.java:198)
        at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**213)
        at
org.apache.hadoop.mapred.**LocalJobRunner$Job.<init>(**
LocalJobRunner.java:93)
        at
org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**
LocalJobRunner.java:373)
        at
org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
JobClient.java:800)
        at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
java:730)
        at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
java:1249)
        at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)
        at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)
        at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
        at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)

2011-08-23 14:47:34,922 INFO  mapred.JobClient - Running job:
job_local_0002
2011-08-23 14:47:34,923 DEBUG conf.Configuration - java.io.IOException:
config(config)
        at org.apache.hadoop.conf.**Configuration.<init>(**
Configuration.java:226)
        at org.apache.hadoop.mapred.**JobConf.<init>(JobConf.java:**184)
        at org.apache.hadoop.mapreduce.**JobContext.<init>(JobContext.**
java:52)
        at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
java:32)
        at org.apache.hadoop.mapred.**JobContext.<init>(JobContext.**
java:38)
        at
org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
LocalJobRunner.java:111)


Can you check permissions, disk space etc?



On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:

Hey Ho,

for some reasons the inverlinks command produces an empty linkdb.

I did:

root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize -noFilter
LinkDb: starting at 2011-08-23 14:47:21
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: false
LinkDb: URL filter: false
LinkDb: adding segment: crawl/segments/20110817164804
LinkDb: adding segment: crawl/segments/20110817164912
LinkDb: adding segment: crawl/segments/20110817165053
LinkDb: adding segment: crawl/segments/20110817165524
LinkDb: adding segment: crawl/segments/20110817170729
LinkDb: adding segment: crawl/segments/20110817171757
LinkDb: adding segment: crawl/segments/20110817172919
LinkDb: adding segment: crawl/segments/20110819135218
LinkDb: adding segment: crawl/segments/20110819165658
LinkDb: adding segment: crawl/segments/20110819170807
LinkDb: adding segment: crawl/segments/20110819171841
LinkDb: adding segment: crawl/segments/20110819173350
LinkDb: adding segment: crawl/segments/20110822135934
LinkDb: adding segment: crawl/segments/20110822141229
LinkDb: adding segment: crawl/segments/20110822143419
LinkDb: adding segment: crawl/segments/20110822143824
LinkDb: adding segment: crawl/segments/20110822144031
LinkDb: adding segment: crawl/segments/20110822144232
LinkDb: adding segment: crawl/segments/20110822144435
LinkDb: adding segment: crawl/segments/20110822144617
LinkDb: adding segment: crawl/segments/20110822144750
LinkDb: adding segment: crawl/segments/20110822144927
LinkDb: adding segment: crawl/segments/20110822145249
LinkDb: adding segment: crawl/segments/20110822150757
LinkDb: adding segment: crawl/segments/20110822152354
LinkDb: adding segment: crawl/segments/20110822152503
LinkDb: adding segment: crawl/segments/20110822153900
LinkDb: adding segment: crawl/segments/20110822155321
LinkDb: adding segment: crawl/segments/20110822155732
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14

After that:

root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
./nutch readlinkdb crawl/linkdb/ -dump linkdump
LinkDb dump: starting at 2011-08-23 14:48:26
LinkDb dump: db: crawl/linkdb/
LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01

And then:

root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#
cd
linkdump/
root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
runtime/local/bin/linkdump#
ll
total 0
-rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000
root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**
runtime/local/bin/linkdump#

As you see, the dump size is 0 byte.

Unfortunately I have no idea what went wrong.

I have attached the hadoop.log for the inverlinks process. Perhaps that
helps anybody?

Re: Empty LinkDB after invertlinks

Reply via email to