Hi Zhao,
Do you have anymore verbose log info from hadoop.log, I have never worked
with Nutch 0.9 but if you could at least indicate whether you get something
like
LOG: info Dedup: starting ... blah blah blah
Taking this to a larger context I am not particularly happy with the
verboseness of
Do I NEED SSHD for Nutch 1.3 in Pseudo Distributed mode?
I am running on a windows server using cygwin (obviously :-)
I can not get haddop/nutch to run in deploy mode and I am not sure if it has
something to do with ssh or not. When I run start-all.sh it gives me some
ssh usage errors and
If it complains about SSH errors then I would ensure that you are logged
into your SSH client e.g. ssh -v localhost, prior to executing any hadoop
scripts. This would make sense.
Further to this, unless you are actually experiencing Nutch related problems
on a pseudo or cluster setup then
Hi Thomas,
This seems a perfect situation for running Nutch jobs in a cluster Hadoop
setup, if you have the resources. From the length of your crawl (2 weeks)
and the erecursive number of cycles, t is inherently hard for anyone, let
alone yourself begin to provide accurate answers to this query.
Can you try with debug logging and using only the inject command only? What do
you see in tasktracker gui?
Hi list,
Is/has anyone else experienced the injector hanging when running Nutch 1.4
in Hadoop 0.20.6 cluster setup? I am not getting any log info and need to
kill the command as it
Lewis,
After shaking off the annoyance of your RTFM Luke answer (I had read
the tutorial several times), I listened to your suggestion (I do respect
my elders ... especially my 'application-elders') and I spent the
weekend reading code, scanning the javadocs files and adding logging
Hi,
As the title suggests, I'm in the process of getting some comprehensive
documentation sorted out for Nutch, this obviously starts at wiki level.
I'm currently working on the IndexStructure page [1]. I would appreciate
if some guys could have a quick look and correct where they see fit.
It's a Hadoop question indeed. I'm also not sure if ssh is a requirement for a
pseudo enviroment. But why not install it anyway? Having sshd doesn't hurt and
it's always a convenience, i can't think of any machine without sshd ;)
If it complains about SSH errors then I would ensure that you
Thanks for the reminder as i believe this is an actual issue! I've got some
indices that cannot be deduplicated from Nutch and die without giving a proper
clue.
I'll reproduce and report back on it. I know it's not a problem of not having
the correct fields marked as STORED since that once
If the url's are not linked from any point (e.g. do not have inlinks) you
cannot discover them. Only work-around is to inject them manually.
If they are linked from somewhere and that somewhere is linked to from any
discoverable page from your injected url's then Nutch should find it unless
Is writing to a local mount even possible in map reduce? AFAIK it all ends up
in HDFS.
Hi
Can you explain how you tried to save raw html obtained during a crawl to a
local drive? I am not entirely sure what you mean here and why you would
want to do so given that we already have an array
11 matches
Mail list logo