[input] [input] [input] [input]
hello,
we are trying to install nutch in single machine using this guide:
http://wiki.apache.org/nutch/NutchHadoopTutorial?highlight=%28nutch%29;,
we are blocked in this step:
*first we execute this command
I want to include embedded flash in my crawls.
Despite (apparently successfully) including the parse-swf plugin, embedded
flash does not seem to be retrieved. Im assuming that the object tags are
not being parsed to find the .swf files.
Can anyone comment?
Thanks
Iain
hello,
When I execute the DFS commande,I have this:
[EMAIL PROTECTED] search]$ bin/start-all.sh
starting namenode, logging to
/nutch/search/logs/hadoop-nutch-namenode-localhost.out
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is
Hi,
I am interested in more comprehensive configuration of the crawl targets. The
actual version only supports lists (files) containing URLs. One thing that
could be desirable is the injection of URLs with metadata attached. This
metadata (inserted into the CrawlData object) could be read by
Hi everbody, Although I change the number of mappers in hadoop-site.xml
and use job.setNumMapTasks method the system gives another number as a
number of mapper, the problem only occurs for number of mapper, number
of reducers works correctly. What I have to do for setting the number
of
Hey list,
I would like to ask you if it is possible to start a search query with a
simple word (e.g. Home). Then Nutch will lookup the word “Home” in a
list with synonyms. Nutch will then recognize that “House” is a synonym
for “Home”. Now, Nutch can start a search query with “House” and
There is also a mapred.tasktracker.tasks.maximum variable which may be
causing the task number to be different.
Dennis
Murat Ali Bayir wrote:
Hi everbody, Although I change the number of mappers in
hadoop-site.xml and use job.setNumMapTasks method the system gives
another number as a number
The name node is running. Run the bin/stop-all.sh script first and then
do a ps -ef | grep NameNode to see if the process is still running. If
it is, it may need to be killed by hand kill -9 processid.
The second problem is the setup of ssh keys as described in previous
email. Also I would
it can not be problem, it only restrict the number of tasks running
simultaneously, there can be pending tasks also, i check that this not
problem. I am not sure but I notice that the number of mapper tasks is
equal to k*number of different parts in input path. To illusrate I have
15 parts
my configs are given below:
in hadoop-site number of mapper = 130
in my code I use job.setNumMapTasks = 130
in hadoop-default numberof mapper = 2
in this configuration I have taken 135 mapper in my job. However there
is no problem in number of reducer.
Andrzej Bialecki wrote:
Murat Ali Bayir
I am experiencing the same issue as a similar post for 8/6. Whenever I
try and fetch pages, I see a lot of fetch of xxx failed with:
java.lang.NullPointerException I have put the appropriate agent info
in both the nutch-default and nutch-site config files. I tried using
DEBUG logging to get
I'm interested in crawling multiple shared folders (among other
things) on a corporate LAN.
It is a LAN of MS clients with Active Directory managed accounts.
The users routinely access the files based on ntfs-level (and
sharing?) permissions.
Idealy, I'd like to set up a central server
Hi,
Could anyone explain me what does exactly the common-terms.utf8 file? I
don't understand the real functionality of this file...
Regards,
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
Hello,
is it possible to crawl e.g. http://www.domain.com,
but to skip crawling all urls matching to (http://www.domain.com/subpage/)
I tried to achieve this with crawl-urlfilter.txt/regex-urlfilter.txt.
but it doesn't work:
-ftp.tu-clausthal.de
Further details:
If I run strace on the process, it looks like this, over and over and over:
gettimeofday({1155249187, 52}, NULL) = 0
gettimeofday({1155249188, 389}, NULL) = 0
gettimeofday({1155249188, 679}, NULL) = 0
gettimeofday({1155249188, 955}, NULL) = 0
15 matches
Mail list logo