Re: Empty LinkDB after invertlinks

Marek Bachmann Wed, 24 Aug 2011 04:21:46 -0700

Hi Lewis,

you are right.

I think the problem is a bit more general. There are some tools whicharen't very verbose about which configuration they use (and some toolsdon't tell much at all ;-) ).

I think there were many problems discussed on the list which wererelated to a wrong configuration files or a standard option that wasn'tnoticed by the user.

So it would be great that if we ran a command, it would tell us whichconfig file it uses and which values it has detected (or which defaultvalues it uses)

Unfortunately I have no knowledge about the configuration architecturethat nutch uses.It seems to me that the the options for most of the tools can be definedin the nutch-site.xml. But I am not aware of how and WHERE this file isinterpreted by the tools.

I think I'll take a look in the sources and see if I could manage toteach the single tools to be more verbose. :)

By the way, that cuts a point which occupies since a while: How can Iknow which options are available for a tool to be configured in thenutch-site.xml


For now I'll make myself acquaint with JIRA, since I never worked with it

regards,

Marek

Am 23.08.2011 23:56, schrieb [email protected]:

Hi Marek,

You make a reasonable point. If you feel that this is something that
should be integrated then maybe consider filing a JIRA with a
comprehensive description of the problem and a proposed solution. If you
do not actually patch this yourself then maybe someone else can provide
a patch in the future should they experience the same as it would be a
nice indication of the situation identified by Sergey.


On Aug 23, 2011 10:48pm, Marek Bachmann <[email protected]> wrote:

Oh yes, thank your very much Sergey, that was the problem.

Would have been nice, if the inverlinks command had told me that it has

ignored them :-)

Cheers,

Marek

Am 23.08.2011 19:26, schrieb Sergey A Volkov:

> Hi

> Is it possible that you fetch documents from just one site/domain?

> Looks like by default nutch ignore internal site links

> (db.ignore.internal.links)

> Sergey Volkov

> On 08/23/2011 07:04 PM, Marek Bachmann wrote:

>> Hi Lewis,

>>

>> thank you for your suggestion.

>> Unfortunately this isn't the problem. Actually I have also tried to

>> merge all segments together and put the one big segment to the

>> inverlinks command. Same (none) effect. :-(

>>

>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#

>> ./nutch mergesegs crawl/one-seg -dir crawl/segments/

>> Merging 29 segments to crawl/one-seg/20110823165144

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164804

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817164912

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165053

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817165524

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817170729

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817171757

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110817172919

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819135218

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819165658

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819170807

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819171841

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110819173350

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822135934

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822141229

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143419

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822143824

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144031

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144232

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144435

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144617

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144750

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822144927

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822145249

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822150757

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152354

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822152503

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822153900

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155321

>> SegmentMerger: adding

>>
file:/home/nutchServer/relaunch_nutch/runtime/local/bin/crawl/segments/20110822155732

>> SegmentMerger: using segment data from: content crawl_generate

>> crawl_fetch crawl_parse parse_data parse_text

>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# rm

>> -rf crawl/linkdb/

>>

>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#

>> ./nutch invertlinks crawl/linkdb crawl/one-seg/20110823165144/

>> -noNormalize -noFilter

>> LinkDb: starting at 2011-08-23 17:01:44

>> LinkDb: linkdb: crawl/linkdb

>> LinkDb: URL normalize: false

>> LinkDb: URL filter: false

>> LinkDb: adding segment: crawl/one-seg/20110823165144

>> LinkDb: finished at 2011-08-23 17:01:52, elapsed: 00:00:08

>>

>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin#

>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump

>> LinkDb dump: starting at 2011-08-23 17:03:12

>> LinkDb dump: db: crawl/linkdb/

>> LinkDb dump: finished at 2011-08-23 17:03:13, elapsed: 00:00:01

>> root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin# cd

>> linkdump/

>>
root@hrz-vm180:/home/nutchServer/relaunch_nutch/runtime/local/bin/linkdump#

>> ll

>> total 0

>> -rwxrwxrwx 1 root root 0 Aug 23 17:03 part-00000

>>

>>

>> Am 23.08.2011 16:44, schrieb lewis john mcgibbney:

>>> Hi

>>>

>>> Small suggestion, but I do not see any -dir argument passed

>>> alongside your

>>> initial invertlinks command. I understand that you have multiple

>>> segment

>>> directories, which have been fetched over a recent number of days,

>>> and that

>>> the output would also suggest the process was properly executed,

>>> however I

>>> have never used the command without the -dir option (as it has

>>> always worked

>>> for me), therefore I can only suggest that this may be the problem.

>>>

>>>

>>>

>>> On Tue, Aug 23, 2011 at 3:29 PM, Marek

>>> [email protected]>wrote:

>>>

>>>> Hi Markus,

>>>>

>>>> thank you for the quick reply. I already searched for this

>>>> Configuration

>>>> error and found:

>>>>

>>>>
http://www.mail-archive.com/**[email protected]/**msg15397.htmlhttp://www.mail-archive.com/[email protected]/msg15397.html>

>>>>

>>>>

>>>> Where they say that "This exception is innocuous - it helps to

>>>> debug at

>>>> which points in the code the Configuration instances are being

>>>> created.

>>>> (...)"

>>>>

>>>> I have indeed not much disk space on the machine but it should be

>>>> enough at

>>>> the moment:

>>>>

>>>>
root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#

>>>> df

>>>> -h .

>>>> Filesystem Size Used Avail Use% Mounted on

>>>> /dev/vda1 20G 5.9G 15G 30% /home

>>>>

>>>> As I am root and all directories under

>>>> /home/nutchServer/relaunch_**nutch/runtime/local/bin

>>>> are set to root:root and 755 permissions shouldn't be the problem.

>>>>

>>>> Any further suggestions? :-/

>>>>

>>>> Thank you once again

>>>>

>>>>

>>>>

>>>> Am 23.08.2011 16:10, schrieb Markus Jelsma:

>>>>

>>>> There are some peculiarities in your log:

>>>>>

>>>>> 2011-08-23 14:47:34,833 DEBUG conf.Configuration -

>>>>> java.io.IOException:

>>>>> config()

>>>>> at org.apache.hadoop.conf.**Configuration.(**

>>>>> Configuration.java:211)

>>>>> at org.apache.hadoop.conf.**Configuration.(**

>>>>> Configuration.java:198)

>>>>> at

>>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**213)

>>>>> at

>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.(**

>>>>> LocalJobRunner.java:93)

>>>>> at

>>>>> org.apache.hadoop.mapred.**LocalJobRunner.submitJob(**

>>>>> LocalJobRunner.java:373)

>>>>> at

>>>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**

>>>>> JobClient.java:800)

>>>>> at

>>>>> org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**

>>>>> java:730)

>>>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**

>>>>> java:1249)

>>>>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:190)

>>>>> at org.apache.nutch.crawl.LinkDb.**run(LinkDb.java:292)

>>>>> at

>>>>> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)

>>>>> at org.apache.nutch.crawl.LinkDb.**main(LinkDb.java:255)

>>>>>

>>>>> 2011-08-23 14:47:34,922 INFO mapred.JobClient - Running job:

>>>>> job_local_0002

>>>>> 2011-08-23 14:47:34,923 DEBUG conf.Configuration -

>>>>> java.io.IOException:

>>>>> config(config)

>>>>> at org.apache.hadoop.conf.**Configuration.(**

>>>>> Configuration.java:226)

>>>>> at

>>>>> org.apache.hadoop.mapred.**JobConf.(JobConf.java:**184)

>>>>> at

>>>>> org.apache.hadoop.mapreduce.**JobContext.(JobContext.**

>>>>> java:52)

>>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**

>>>>> java:32)

>>>>> at org.apache.hadoop.mapred.**JobContext.(JobContext.**

>>>>> java:38)

>>>>> at

>>>>> org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**

>>>>> LocalJobRunner.java:111)

>>>>>

>>>>>

>>>>> Can you check permissions, disk space etc?

>>>>>

>>>>>

>>>>>

>>>>> On Tuesday 23 August 2011 16:05:16 Marek Bachmann wrote:

>>>>>

>>>>>> Hey Ho,

>>>>>>

>>>>>> for some reasons the inverlinks command produces an empty linkdb.

>>>>>>

>>>>>> I did:

>>>>>>

>>>>>>
root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#

>>>>>>

>>>>>> ./nutch invertlinks crawl/linkdb crawl/segments/* -noNormalize

>>>>>> -noFilter

>>>>>> LinkDb: starting at 2011-08-23 14:47:21

>>>>>> LinkDb: linkdb: crawl/linkdb

>>>>>> LinkDb: URL normalize: false

>>>>>> LinkDb: URL filter: false

>>>>>> LinkDb: adding segment: crawl/segments/20110817164804

>>>>>> LinkDb: adding segment: crawl/segments/20110817164912

>>>>>> LinkDb: adding segment: crawl/segments/20110817165053

>>>>>> LinkDb: adding segment: crawl/segments/20110817165524

>>>>>> LinkDb: adding segment: crawl/segments/20110817170729

>>>>>> LinkDb: adding segment: crawl/segments/20110817171757

>>>>>> LinkDb: adding segment: crawl/segments/20110817172919

>>>>>> LinkDb: adding segment: crawl/segments/20110819135218

>>>>>> LinkDb: adding segment: crawl/segments/20110819165658

>>>>>> LinkDb: adding segment: crawl/segments/20110819170807

>>>>>> LinkDb: adding segment: crawl/segments/20110819171841

>>>>>> LinkDb: adding segment: crawl/segments/20110819173350

>>>>>> LinkDb: adding segment: crawl/segments/20110822135934

>>>>>> LinkDb: adding segment: crawl/segments/20110822141229

>>>>>> LinkDb: adding segment: crawl/segments/20110822143419

>>>>>> LinkDb: adding segment: crawl/segments/20110822143824

>>>>>> LinkDb: adding segment: crawl/segments/20110822144031

>>>>>> LinkDb: adding segment: crawl/segments/20110822144232

>>>>>> LinkDb: adding segment: crawl/segments/20110822144435

>>>>>> LinkDb: adding segment: crawl/segments/20110822144617

>>>>>> LinkDb: adding segment: crawl/segments/20110822144750

>>>>>> LinkDb: adding segment: crawl/segments/20110822144927

>>>>>> LinkDb: adding segment: crawl/segments/20110822145249

>>>>>> LinkDb: adding segment: crawl/segments/20110822150757

>>>>>> LinkDb: adding segment: crawl/segments/20110822152354

>>>>>> LinkDb: adding segment: crawl/segments/20110822152503

>>>>>> LinkDb: adding segment: crawl/segments/20110822153900

>>>>>> LinkDb: adding segment: crawl/segments/20110822155321

>>>>>> LinkDb: adding segment: crawl/segments/20110822155732

>>>>>> LinkDb: merging with existing linkdb: crawl/linkdb

>>>>>> LinkDb: finished at 2011-08-23 14:47:35, elapsed: 00:00:14

>>>>>>

>>>>>> After that:

>>>>>>

>>>>>>
root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#

>>>>>>

>>>>>> ./nutch readlinkdb crawl/linkdb/ -dump linkdump

>>>>>> LinkDb dump: starting at 2011-08-23 14:48:26

>>>>>> LinkDb dump: db: crawl/linkdb/

>>>>>> LinkDb dump: finished at 2011-08-23 14:48:27, elapsed: 00:00:01

>>>>>>

>>>>>> And then:

>>>>>>

>>>>>>
root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**runtime/local/bin#

>>>>>>

>>>>>> cd

>>>>>> linkdump/

>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**

>>>>>> runtime/local/bin/linkdump#

>>>>>> ll

>>>>>> total 0

>>>>>> -rwxrwxrwx 1 root root 0 Aug 23 14:48 part-00000

>>>>>> root@hrz-vm180:/home/**nutchServer/relaunch_nutch/**

>>>>>> runtime/local/bin/linkdump#

>>>>>>

>>>>>> As you see, the dump size is 0 byte.

>>>>>>

>>>>>> Unfortunately I have no idea what went wrong.

>>>>>>

>>>>>> I have attached the hadoop.log for the inverlinks process.

>>>>>> Perhaps that

>>>>>> helps anybody?

>>>>>>

>>>>>

>>>>>

>>>>

>>>

>>>

>>

Re: Empty LinkDB after invertlinks

Reply via email to