Re: Stalling during fetch (0.7)
Further details: If I run strace on the process, it looks like this, over and over and over: gettimeofday({1155249187, 52}, NULL) = 0 gettimeofday({1155249188, 389}, NULL) = 0 gettimeofday({1155249188, 679}, NULL) = 0 gettimeofday({1155249188, 955}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1155249188, 1235000}) = 0 futex(0xb1f0185c, FUTEX_WAIT, 7163, {0, 99972}) = -1 ETIMEDOUT (Connection timed out) futex(0x805d250, FUTEX_WAKE, 1) = 0 futex(0x805c378, FUTEX_WAIT, 2, NULL) = 0 futex(0x805c378, FUTEX_WAKE, 1) = 0 I'm afraid I don't know how to go about finding what part of the code might be causing this... Any ideas? Ben On 8/10/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote: Hello, Nutch is stalling in the fetch process. I've run it twice now, and it is stopping on the *same* URL both times. I don't get what's going on! The last status report was: 060810 145315 status: segment 20060810142649, 7900 pages, 14 errors, 98421231 bytes, 1571224 ms 060810 145315 status: 5.0279274 pages/s, 489.3738 kb/s, 12458.384bytes/page Then, exactly 94 documents later with no errors in between, it just stops. On what appears to be a perfectly normal URL and a perfectly normal page. I don't get it. How can I debug this situation further, to see what's going on? I'm really frustrated since I don't know where to start looking. Nutch is still running, taking up a lot of CPU. I don't want to kill it unless it really stuck. How can I tell? Ben
Re: More Fetcher NullPointerException
I had the same problem before. Just read http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg04303.html Make that tiny change on line 385 of HttpBase.java and it will work fine. Raphael Sellek, Greg wrote: I am experiencing the same issue as a similar post for 8/6. Whenever I try and fetch pages, I see a lot of "fetch of xxx failed with: java.lang.NullPointerException" I have put the appropriate agent info in both the nutch-default and nutch-site config files. I tried using DEBUG logging to get more info, but this error is the extent of what I see. Seems to happen on about 95% of the urls I am trying to crawl. BTW, this happens with both the 0.8 build and the latest nightly build. TIA for any advice as to what I am doing wrong. Greg
Stalling during fetch (0.7)
Hello, Nutch is stalling in the fetch process. I've run it twice now, and it is stopping on the *same* URL both times. I don't get what's going on! The last status report was: 060810 145315 status: segment 20060810142649, 7900 pages, 14 errors, 98421231 bytes, 1571224 ms 060810 145315 status: 5.0279274 pages/s, 489.3738 kb/s, 12458.384 bytes/page Then, exactly 94 documents later with no errors in between, it just stops. On what appears to be a perfectly normal URL and a perfectly normal page. I don't get it. How can I debug this situation further, to see what's going on? I'm really frustrated since I don't know where to start looking. Nutch is still running, taking up a lot of CPU. I don't want to kill it unless it really stuck. How can I tell? Ben
crawl-urlfilter subpages of domains
Hello, is it possible to crawl e.g. http://www.domain.com, but to skip crawling all urls matching to (http://www.domain.com/subpage/) I tried to achieve this with crawl-urlfilter.txt/regex-urlfilter.txt. but it doesn't work: -ftp.tu-clausthal.de -^http://([a-z0-9]*\.)asta.tu-clausthal.de/de/mobil/ +^http://([a-z0-9]*\.)asta.tu-clausthal.de +^http://([a-z0-9]*\.)*tu-clausthal.de/ # skip everything else -. skipping ftp.tu-clausthal.de works perfect, but http://www.asta.tu-clausthal.de/de/mobil/ is still indexed, which takes a long time to crawl. regards, Jens Martin Schubert
Nutch vs. Google Appliance
Hello all - I have been taking a look at Nutch for purposes of indexing a large pile of internal LAN files at our company, and so far it looks quite impressive. I believe it could substitute for the Google Mini appliance. However, the bigger Google boxes add more features that I am not sure can be duplicated in Nutch. Specifically I am interested in the indexing and searching for secured files. Apparently Google will index all files, including those that are secure (given appropriate authority) - but will only show search results based on the security and credentials of the searcher. In other words, if you don't have access to a document, Google won't show you that it even exists. Can something like that be done in Nutch? Are there other differences between Nutch and Google? - The contents of this communication, including any attachment(s), are confidential and may be privileged. If you are not the intended recipient (or are not receiving this communication on behalf of the intended recipient), please notify the sender immediately and delete or destroy this communication without reading it, and without making, forwarding, or retaining any copy or record of it or its contents. Thank you. Note: We have taken precautions against viruses, but take no responsibility for loss or damage caused by any virus present.
common-terms.utf8
Hi, Could anyone explain me what does exactly the common-terms.utf8 file? I don't understand the real functionality of this file... Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
file access rights/permissions considerations - the least painful way
I'm interested in crawling multiple shared folders (among other things) on a corporate LAN. It is a LAN of MS clients with Active Directory managed accounts. The users routinely access the files based on ntfs-level (and sharing?) permissions. Idealy, I'd like to set up a central server (probably linux, but any *n*x would do) where I'd mount all the shared folders. I'd then set up apache so that the files are accessible via http and, more importantly, webdav. I imagine apache could use mod_dav, mod_auth and possibly one or two other modules to regulate access priviledges - I could very well be completely wrong here. Finally, I'd like to set up nutch to crawl the shared documents through the web server, so that the stored links are valid in the whole LAN. Nutch would therefore require absolute access to all documents, but the documents would be served via a web server who checks user identities and access rights. Nutch users who've tackled the access rights problem themselves would save me a world of time, effort and trouble with a couple of pointers on how to go about the whole security issue. If the setup I described is the worst possible way to go about it, I'd appreciate a notice saying so and elaborating why. :) TIA, t.n.a.
More Fetcher NullPointerException
I am experiencing the same issue as a similar post for 8/6. Whenever I try and fetch pages, I see a lot of "fetch of xxx failed with: java.lang.NullPointerException" I have put the appropriate agent info in both the nutch-default and nutch-site config files. I tried using DEBUG logging to get more info, but this error is the extent of what I see. Seems to happen on about 95% of the urls I am trying to crawl. BTW, this happens with both the 0.8 build and the latest nightly build. TIA for any advice as to what I am doing wrong. Greg
Re: number of mapper
Take a look at this, http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces It will answer why you have a few more map tasks that are set in the configuration. Dennis Murat Ali Bayir wrote: my configs are given below: in hadoop-site number of mapper = 130 in my code I use job.setNumMapTasks = 130 in hadoop-default numberof mapper = 2 in this configuration I have taken 135 mapper in my job. However there is no problem in number of reducer. Andrzej Bialecki wrote: Murat Ali Bayir wrote: Hi everbody, Although I change the number of mappers in hadoop-site.xml and use job.setNumMapTasks method the system gives another number as a number of mapper, the problem only occurs for number of mapper, number of reducers works correctly. What I have to do for setting the number of mappers in the system? Any value that you put in hadoop-site.xml will always override any other config settings, even those set programatically in job.setNumMapTasks. You should remove these settings from hadoop-site, and put them into mapred-default.xml.
Re: number of mapper
my configs are given below: in hadoop-site number of mapper = 130 in my code I use job.setNumMapTasks = 130 in hadoop-default numberof mapper = 2 in this configuration I have taken 135 mapper in my job. However there is no problem in number of reducer. Andrzej Bialecki wrote: Murat Ali Bayir wrote: Hi everbody, Although I change the number of mappers in hadoop-site.xml and use job.setNumMapTasks method the system gives another number as a number of mapper, the problem only occurs for number of mapper, number of reducers works correctly. What I have to do for setting the number of mappers in the system? Any value that you put in hadoop-site.xml will always override any other config settings, even those set programatically in job.setNumMapTasks. You should remove these settings from hadoop-site, and put them into mapred-default.xml.
Re: number of mapper
Murat Ali Bayir wrote: Hi everbody, Although I change the number of mappers in hadoop-site.xml and use job.setNumMapTasks method the system gives another number as a number of mapper, the problem only occurs for number of mapper, number of reducers works correctly. What I have to do for setting the number of mappers in the system? Any value that you put in hadoop-site.xml will always override any other config settings, even those set programatically in job.setNumMapTasks. You should remove these settings from hadoop-site, and put them into mapred-default.xml. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: number of mapper
it can not be problem, it only restrict the number of tasks running simultaneously, there can be pending tasks also, i check that this not problem. I am not sure but I notice that the number of mapper tasks is equal to k*number of different parts in input path. To illusrate I have 15 parts in my input path, I set the number of mappers 130 in hadoop-site.xml however when I run the job I have 135 mapper which is 9 times of number of input part. Dennis Kubes wrote: There is also a mapred.tasktracker.tasks.maximum variable which may be causing the task number to be different. Dennis Murat Ali Bayir wrote: Hi everbody, Although I change the number of mappers in hadoop-site.xml and use job.setNumMapTasks method the system gives another number as a number of mapper, the problem only occurs for number of mapper, number of reducers works correctly. What I have to do for setting the number of mappers in the system? .
Re: problems with start-all command
The name node is running. Run the bin/stop-all.sh script first and then do a ps -ef | grep NameNode to see if the process is still running. If it is, it may need to be killed by hand kill -9 processid. The second problem is the setup of ssh keys as described in previous email. Also I would recommend NOT running the namenode as root but in having a specific user setup to run the various servers as described in the tutorial. Dennis kawther khazri wrote: [input] [input] [input] [input] hello, we are trying to install nutch in single machine using this guide: "http://wiki.apache.org/nutch/NutchHadoopTutorial?highlight=%28nutch%29";, we are blocked in this step: *first we execute this command as root [EMAIL PROTECTED] search]# bin/start-all.sh namenode running as process 16323. Stop it first. [EMAIL PROTECTED]'s password: localhost: starting datanode, logging to /nutch/search/logs/hadoop-root-datanode-localhost.localdomain.out starting jobtracker, logging to /nutch/search/logs/hadoop-root-jobtracker-localhost.localdomain.out [EMAIL PROTECTED]'s password: localhost: tasktracker running as process 16448. Stop it first. *second we execute it in a normal user's session:nutch [EMAIL PROTECTED] search]$ bin/start-all.sh starting namenode, logging to /nutch/search/logs/hadoop-nutch-namenode-localhost.localdomain.out The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is 9e:56:da:f3:72:dc:1a:91:5d:78:89:ce:89:04:3d:d3. Are you sure you want to continue connecting (yes/no)? yes localhost: Failed to add the host to the list of known hosts (/nutch/home/.ssh/known_hosts). Enter passphrase for key '/nutch/home/.ssh/id_rsa': [EMAIL PROTECTED]'s password: localhost: starting datanode, logging to /nutch/search/logs/hadoop-nutch-datanode-localhost.localdomain.out starting jobtracker, logging to /nutch/search/logs/hadoop-nutch-jobtracker-localhost.localdomain.out The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is 9e:56:da:f3:72:dc:1a:91:5d:78:89:ce:89:04:3d:d3. Are you sure you want to continue connecting (yes/no)? yes localhost: Failed to add the host to the list of known hosts (/nutch/home/.ssh/known_hosts). Enter passphrase for key '/nutch/home/.ssh/id_rsa': [EMAIL PROTECTED]'s password: localhost: starting tasktracker, logging to /nutch/search/logs/hadoop-nutch-tasktracker-localhost.localdomain.out what is the différence between the two. what's the meaning this message:"namenode running as process 16323. Stop it first." Is it normal to obtain this. I don't know the cause of this error. Please, If you have any idea,help me. BEST REGARDS, - Découvrez un nouveau moyen de poser toutes vos questions quelque soit le sujet ! Yahoo! Questions/Réponses pour partager vos connaissances, vos opinions et vos expériences. Cliquez ici.
Re: number of mapper
There is also a mapred.tasktracker.tasks.maximum variable which may be causing the task number to be different. Dennis Murat Ali Bayir wrote: Hi everbody, Although I change the number of mappers in hadoop-site.xml and use job.setNumMapTasks method the system gives another number as a number of mapper, the problem only occurs for number of mapper, number of reducers works correctly. What I have to do for setting the number of mappers in the system?
Index with synonyms
Hey list, I would like to ask you if it is possible to start a search query with a simple word (e.g. "Home"). Then Nutch will lookup the word “Home” in a list with synonyms. Nutch will then recognize that “House” is a synonym for “Home”. Now, Nutch can start a search query with “House” and “Home” and show both results. Is that possible? Regards ___ Telefonate ohne weitere Kosten vom PC zum PC: http://messenger.yahoo.de
number of mapper
Hi everbody, Although I change the number of mappers in hadoop-site.xml and use job.setNumMapTasks method the system gives another number as a number of mapper, the problem only occurs for number of mapper, number of reducers works correctly. What I have to do for setting the number of mappers in the system?
Extended crawling configuration with "mapred.input.value.class"?
Hi, I am interested in more comprehensive configuration of the crawl targets. The actual version only supports lists (files) containing URLs. One thing that could be desirable is the injection of URLs with metadata attached. This metadata (inserted into the CrawlData object) could be read by plugins in following steps of the whole indexing process and used as hints for processing decissions. This would be similar to the use of the metadata for getting the score from one stage to the next stage or even to the outlinks in the next cycle. Now my question: Can I use an XMLWritable (this is my new configuartion class) instead of UTF8 by setting the Hadoop config entry mapred.input.value.class to XMLWritable? Is this Hadoop setting only used for URL injection or does my change of the settings do any harm to other utilizations that also use the class configured at this point? Cheers, Timo.
problem with the DFS commande
hello, When I execute the DFS commande,I have this: [EMAIL PROTECTED] search]$ bin/start-all.sh starting namenode, logging to /nutch/search/logs/hadoop-nutch-namenode-localhost.out The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is 81:0e:49:ce:61:8c:7b:09:1f:dc:5d:2c:64:f1:68:d6. Are you sure you want to continue connecting (yes/no)? yes localhost: Failed to add the host to the list of known hosts (/nutch/home/.ssh/known_hosts). Enter passphrase for key '/nutch/home/.ssh/id_rsa': [EMAIL PROTECTED]'s password: localhost: starting datanode, logging to /nutch/search/logs/hadoop-nutch-datanode-localhost.out starting jobtracker, logging to /nutch/search/logs/hadoop-nutch-jobtracker-localhost.out The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is 81:0e:49:ce:61:8c:7b:09:1f:dc:5d:2c:64:f1:68:d6. Are you sure you want to continue connecting (yes/no)? yes localhost: Failed to add the host to the list of known hosts (/nutch/home/.ssh/known_hosts). Enter passphrase for key '/nutch/home/.ssh/id_rsa': [EMAIL PROTECTED]'s password: localhost: starting tasktracker, logging to /nutch/search/logs/hadoop-nutch-tasktracker-localhost.out [EMAIL PROTECTED] search]$ bin/hadoop dfs -ls log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /nutch/search/logs/hadoop.log (Permission denied) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:177) at java.io.FileOutputStream.(FileOutputStream.java:102) at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163) at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:215) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:654) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:612) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:509) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:415) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:441) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:468) at org.apache.log4j.LogManager.(LogManager.java:122) at org.apache.log4j.Logger.getLogger(Logger.java:104) at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229) at org.apache.commons.logging.impl.Log4JLogger.(Log4JLogger.java:65) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:494) at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:529) at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235) at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:370) at org.apache.hadoop.util.ToolBase.(ToolBase.java:71) log4j:ERROR Either File or DatePattern options are not set for appender [DRFA]. java.lang.NullPointerException at java.net.Socket.(Socket.java:358) at java.net.Socket.(Socket.java:208) at org.apache.hadoop.ipc.Client$Connection.(Client.java:113) at org.apache.hadoop.ipc.Client.getConnection(Client.java:359) at org.apache.hadoop.ipc.Client.call(Client.java:297) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:150) at org.apache.hadoop.dfs.$Proxy0.getListing(Unknown Source) at org.apache.hadoop.dfs.DFSClient.listPaths(DFSClient.java:332) at org.apache.hadoop.dfs.DistributedFileSystem.listPathsRaw(DistributedFileSystem.java:157) at org.apache.hadoop.fs.FileSystem.listPaths(FileSystem.java:509) at org.apache.hadoop.fs.FileSystem.listPaths(FileSystem.java:479) at org.apache.hadoop.dfs.DFSShell.ls(DFSShell.java:165) at org.apache.hadoop.dfs.DFSShell.run(DFSShell.java:329) at org.apache.hadoop.util.ToolBase.executeCommand(ToolBase.java:173) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:182) at org.apache.hadoop.dfs.DFSShell.main(DFSShell.java:360) - Découvrez un nouveau moyen de poser toutes vos questions quelque s
Crawling flash
I want to include embedded flash in my crawls. Despite (apparently successfully) including the parse-swf plugin, embedded flash does not seem to be retrieved. Im assuming that the object tags are not being parsed to find the .swf files. Can anyone comment? Thanks Iain
problems with start-all command
[input] [input] [input] [input] hello, we are trying to install nutch in single machine using this guide: "http://wiki.apache.org/nutch/NutchHadoopTutorial?highlight=%28nutch%29";, we are blocked in this step: *first we execute this command as root [EMAIL PROTECTED] search]# bin/start-all.sh namenode running as process 16323. Stop it first. [EMAIL PROTECTED]'s password: localhost: starting datanode, logging to /nutch/search/logs/hadoop-root-datanode-localhost.localdomain.out starting jobtracker, logging to /nutch/search/logs/hadoop-root-jobtracker-localhost.localdomain.out [EMAIL PROTECTED]'s password: localhost: tasktracker running as process 16448. Stop it first. *second we execute it in a normal user's session:nutch [EMAIL PROTECTED] search]$ bin/start-all.sh starting namenode, logging to /nutch/search/logs/hadoop-nutch-namenode-localhost.localdomain.out The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is 9e:56:da:f3:72:dc:1a:91:5d:78:89:ce:89:04:3d:d3. Are you sure you want to continue connecting (yes/no)? yes localhost: Failed to add the host to the list of known hosts (/nutch/home/.ssh/known_hosts). Enter passphrase for key '/nutch/home/.ssh/id_rsa': [EMAIL PROTECTED]'s password: localhost: starting datanode, logging to /nutch/search/logs/hadoop-nutch-datanode-localhost.localdomain.out starting jobtracker, logging to /nutch/search/logs/hadoop-nutch-jobtracker-localhost.localdomain.out The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is 9e:56:da:f3:72:dc:1a:91:5d:78:89:ce:89:04:3d:d3. Are you sure you want to continue connecting (yes/no)? yes localhost: Failed to add the host to the list of known hosts (/nutch/home/.ssh/known_hosts). Enter passphrase for key '/nutch/home/.ssh/id_rsa': [EMAIL PROTECTED]'s password: localhost: starting tasktracker, logging to /nutch/search/logs/hadoop-nutch-tasktracker-localhost.localdomain.out what is the différence between the two. what's the meaning this message:"namenode running as process 16323. Stop it first." Is it normal to obtain this. I don't know the cause of this error. Please, If you have any idea,help me. BEST REGARDS, - Découvrez un nouveau moyen de poser toutes vos questions quelque soit le sujet ! Yahoo! Questions/Réponses pour partager vos connaissances, vos opinions et vos expériences. Cliquez ici.