Re: a question about job failed

2011-08-29 Thread lewis john mcgibbney
Hi Zhao, Do you have anymore verbose log info from hadoop.log, I have never worked with Nutch 0.9 but if you could at least indicate whether you get something like LOG: info Dedup: starting ... blah blah blah Taking this to a larger context I am not particularly happy with the verboseness of

SSHD for Nutch 1.3 in Pseudo Distributed mode

2011-08-29 Thread webdev1977
Do I NEED SSHD for Nutch 1.3 in Pseudo Distributed mode? I am running on a windows server using cygwin (obviously :-) I can not get haddop/nutch to run in deploy mode and I am not sure if it has something to do with ssh or not. When I run start-all.sh it gives me some ssh usage errors and

Re: SSHD for Nutch 1.3 in Pseudo Distributed mode

2011-08-29 Thread lewis john mcgibbney
If it complains about SSH errors then I would ensure that you are logged into your SSH client e.g. ssh -v localhost, prior to executing any hadoop scripts. This would make sense. Further to this, unless you are actually experiencing Nutch related problems on a pseudo or cluster setup then

Re: Parameter tuning or how to accelerate fetching

2011-08-29 Thread lewis john mcgibbney
Hi Thomas, This seems a perfect situation for running Nutch jobs in a cluster Hadoop setup, if you have the resources. From the length of your crawl (2 weeks) and the erecursive number of cycles, t is inherently hard for anyone, let alone yourself begin to provide accurate answers to this query.

Re: Injector hanging on Hadoop 0.20.6

2011-08-29 Thread Markus Jelsma
Can you try with debug logging and using only the inject command only? What do you see in tasktracker gui? Hi list, Is/has anyone else experienced the injector hanging when running Nutch 1.4 in Hadoop 0.20.6 cluster setup? I am not getting any log info and need to kill the command as it

Re: Trying to understand and use URLmeta

2011-08-29 Thread John R. Brinkema
Lewis, After shaking off the annoyance of your RTFM Luke answer (I had read the tutorial several times), I listened to your suggestion (I do respect my elders ... especially my 'application-elders') and I spent the weekend reading code, scanning the javadocs files and adding logging

Re: Trying to complete index structure wiki page

2011-08-29 Thread Markus Jelsma
Hi, As the title suggests, I'm in the process of getting some comprehensive documentation sorted out for Nutch, this obviously starts at wiki level. I'm currently working on the IndexStructure page [1]. I would appreciate if some guys could have a quick look and correct where they see fit.

Re: SSHD for Nutch 1.3 in Pseudo Distributed mode

2011-08-29 Thread Markus Jelsma
It's a Hadoop question indeed. I'm also not sure if ssh is a requirement for a pseudo enviroment. But why not install it anyway? Having sshd doesn't hurt and it's always a convenience, i can't think of any machine without sshd ;) If it complains about SSH errors then I would ensure that you

Re: a question about job failed

2011-08-29 Thread Markus Jelsma
Thanks for the reminder as i believe this is an actual issue! I've got some indices that cannot be deduplicated from Nutch and die without giving a proper clue. I'll reproduce and report back on it. I know it's not a problem of not having the correct fields marked as STORED since that once

Re: Recursively searching through web dirs

2011-08-29 Thread Markus Jelsma
If the url's are not linked from any point (e.g. do not have inlinks) you cannot discover them. Only work-around is to inject them manually. If they are linked from somewhere and that somewhere is linked to from any discoverable page from your injected url's then Nutch should find it unless

Re: How to save html source to local drive

2011-08-29 Thread Markus Jelsma
Is writing to a local mount even possible in map reduce? AFAIK it all ends up in HDFS. Hi Can you explain how you tried to save raw html obtained during a crawl to a local drive? I am not entirely sure what you mean here and why you would want to do so given that we already have an array