[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453820 ] Andrzej Bialecki commented on NUTCH-339: - This looks weird, if anything it rather seems caused by a bug in Hadoop - are you able to run readseg -dump on this fetchlist? Another idea: do you have any lease expired messages in your log about that time? It looks like maybe the underlying input stream has been closed. Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12453919 ] Sean Dean commented on NUTCH-233: - Could I suggest that this change, from .*(/.+?)/.*?\1/.*?\1/ to .*(/[^/]+)/[^/]+\1/[^/]+\1/ be committed to at least trunk for the time being. I recently created a segment with 1M urls exactly, I ran the fetch and it did indeed stall on the reduce part of the operation due to the regex filter. This was verified with a thread dump (kill -3 pid) on FreeBSD. I then made the suggested change in the config file and re-fetched the exact same segment. It completed without issue. I'm aware we might be losing some filtering functionality with this new expression, but is it not better then knowing there is always the chance your whole-web crawl fetch will fail because of this? wrong regular expression hang reduce process for ever - Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9.0 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. May be it was missed to change it when the regular expression packages was changed. The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang. 060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java: I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex) However may people can review it and can suggest improvements, since the old regex would match : abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old regex would also match : abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
updating index without refetching
Hi All, Is it possible to update the index without refetching everything? I have changed logic of one of my plugins (which also sets a custom field in the index) - and I would like this field to get updated without refetching everything - is it doable? Thanks,
RE: updating index without refitting
Hi, You do not mention if the new field's data is stored as a metadata? Does the value data being created during parse or is it added only during the index phase? If your new field is created during the parse process than you could delete only the parse folders and run the parse process i.e. (delete segment/crawl parse , segment/parse data , segment/parse text) and run bin/nutch parse segment Or if your field data is added during the index process than re-create your index. In any case it doesn't seem to me you would need to re-fetch. HTH Gal -Original Message- From: DS jha [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 28, 2006 4:11 PM To: nutch-dev@lucene.apache.org Subject: updating index without refetching Hi All, Is it possible to update the index without refetching everything? I have changed logic of one of my plugins (which also sets a custom field in the index) - and I would like this field to get updated without refetching everything - is it doable? Thanks,
[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable
[ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453932 ] Alan Tanaman commented on NUTCH-407: In our team we feel that this patch would have been beneficial in practical terms. In the context of the enterprise intelligence solution which we are gradually porting over to Nutch, the emphasis is on ease of configuration. We try to avoid exposing features such as regex filter, which although are very powerful for a more experienced user, are perhaps confusing to the novice. This is because we are primarily focused on the enterprise and less on the WWW. This is why we preconfigure the db.ignore.external.links property to true, and then only the urls file is used to seed the crawl. Our ideal is to have a collection of predefined configuration settings for specific scenarios -- e.g. Enterprise-XML, Enterprise-Documents, Enterprise-Database, Internet-News etc. We have a script that generates multiple crawlers, each one with different sources to be crawled, and although possible, it isn't the most practical to change the filters for each one manually based on the individual user requirements. I realise this patch is closed, but how about another approach that says that FileResponse.java looks at db.ignore.external.links and decides based on this whether to go up the tree. Obviously, this would also prevent you from crawling outlinks to the WWW embedded in documents, but when crawling an enterprise file system, you usually don't want to go all over the place anyway. As I see it, file systems are different to the web in that they are inherently hierarchical whereas the web is as its name implies, non-hierarchical. Therefore, when crawling a file system, going up the tree is just as much an external URI (so to speak) as a link to a web site. *Ducks for cover* Alan Make Nutch crawling parent directories for file protocol configurable - Key: NUTCH-407 URL: http://issues.apache.org/jira/browse/NUTCH-407 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Thorsten Scherler Assigned To: Andrzej Bialecki Attachments: 407.fix.diff http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html I am looking into fixing some very weird behavior of the file protocol. I am using 0.8. Researching this topic I found http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html and http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch I am on Ubuntu but I have the same problem that nutch is going down the tree (including parents) and not up (including children from the root url). Further I would vote to make the fetch-parents optional and defined per a property whether I would like this not very intuitive feature. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable
[ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453934 ] Chris A. Mattmann commented on NUTCH-407: - I'm not entirey sure what the right answer to this is. One thing that I do know is that a colleague at my own work ran into this exact same issue while first attempting to use Nutch on his enterprise search application. Confused the heck out of him and he ended up including in the urlfilter-regex what Andrzej mentions above, i.e., only crawl from the top-level down. He mentioned to me that he thought this was a kludge and I can't say that I disagreed with him. My +1 for figuring out a better way to solve this problem... Make Nutch crawling parent directories for file protocol configurable - Key: NUTCH-407 URL: http://issues.apache.org/jira/browse/NUTCH-407 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Thorsten Scherler Assigned To: Andrzej Bialecki Attachments: 407.fix.diff http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html I am looking into fixing some very weird behavior of the file protocol. I am using 0.8. Researching this topic I found http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html and http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch I am on Ubuntu but I have the same problem that nutch is going down the tree (including parents) and not up (including children from the root url). Further I would vote to make the fetch-parents optional and defined per a property whether I would like this not very intuitive feature. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: updating index without refitting
new field's data is also stored as a meta data - value is assigned during parse process and then during index, it reads meta-data field value and adds it to an index. Looks like, I will have to run parse and index again. Thanks much. On 11/28/06, Gal Nitzan [EMAIL PROTECTED] wrote: Hi, You do not mention if the new field's data is stored as a metadata? Does the value data being created during parse or is it added only during the index phase? If your new field is created during the parse process than you could delete only the parse folders and run the parse process i.e. (delete segment/crawl parse , segment/parse data , segment/parse text) and run bin/nutch parse segment Or if your field data is added during the index process than re-create your index. In any case it doesn't seem to me you would need to re-fetch. HTH Gal -Original Message- From: DS jha [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 28, 2006 4:11 PM To: nutch-dev@lucene.apache.org Subject: updating index without refetching Hi All, Is it possible to update the index without refetching everything? I have changed logic of one of my plugins (which also sets a custom field in the index) - and I would like this field to get updated without refetching everything - is it doable? Thanks,
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453975 ] Sami Siren commented on NUTCH-339: -- perhaps thath exception is just a consequence of something other like this: 2006-11-27 07:35:09,434 INFO fetcher.Fetcher2 - -activeThreads=296, spinWaiting=204, fetchQueues.totalSize=0 2006-11-27 07:35:09,434 WARN fetcher.Fetcher2 - Aborting with 296 hung threads.2006-11-27 07:35:09,434 INFO mapred.LocalJobRunner - 3821 pages, 207 errors, 5.5 pages/s, 780 kb/s, and the next log entry is: 2006-11-27 07:35:15,443 INFO mapred.JobClient - map 100% reduce 0% Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12454045 ] Sami Siren commented on NUTCH-339: -- I am running with 300 thread, and in parsing mode thread dump shows: 191 threads waiting on condition at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:422) 71 waiting for monitor entry at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.getFetchItem(Fetcher2.java:306) - waiting to lock 0x52fa7328 (a org.apache.nutch.fetcher.Fetcher2$FetchItemQueues) at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:415) rest are runnable cpu usage starts low but very quickly in ramps up and machine gets almost unresponsive. fetching speed is low because all cpu goes to something else. Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Indexing and Re-crawling site
Hi guys, I have a few questions regarding the way nutch indexes and the best way a recrawl can be implemented. 1. Why does nutch has to create a new index every time when indexing, while it can just merge it with the old existing index? I try to change the value in the IndexMerger class to 'false' while creating an index therefore Lucene doesn't recreate a new index each time it is indexing. The problem with this is, I keep on having some exception when it tries to merge the indexes. There is a lock time out exception that is thrown by the IndexMerger. And consequently the index that get created. Is it possible to let nutch index by merging it with an existing index? I have to crawl about 100Gb of data and if there are only a few documents that have been changed, I don't nutch to recreate a new index because of that but update the existing index by merging it with the new one. I need some light on this. 2. What is the best way to make nutch re-crawl? I have implemented a class that loops the crawl process; it has a crawl interval which is set in a property file and a running status. The running status is a Boolean variable which is set to true if the re-crawl process is ongoing or false if it should stop. But with this approach, it seems that the index is not being fully generated. The values in the index cannot be queried. The re-crawl is in java which calls an underlying ant script to run nutch. I know most re-crawl are written as batch script but can you tell me which one do you recommended? A batch script or a loop-based java program? 3. What is the best way of implementing nutch has a window service or unix daemon? Thanks, Armel
Re: implement thai language indexing and search
i used an existing ThaiAnalyzer which was in lucene package. ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled and placed all class files in a jar - analysis-th.jar (do i need to bundle the ngp file in the jar as well ?) 1. You don't have to refactor the lucene analyzer. Just to wrap it like I do with french and german analyzers (they both use some analyzers from lucene). 2. Analyzer doesn't need ngp files... I think you misunderstood something: 2.1 In one side there is the language identifier that use NGP files to identify language of a document 2.2 In the other sided if a suitable analyzer is found for the identified language, it is used to analyze the document. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/