[jira] Created: (NUTCH-305) Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP
Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP -- Key: NUTCH-305 URL: http://issues.apache.org/jira/browse/NUTCH-305 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: chris finne -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: anchor text modifications
Brian Higgins wrote: Hi, i'm pretty new to Nutch and i'm trying to modify the code so it stores the words before and after a hyperlink as well as the anchor text. i've ben looking through the nutch code for a couple of days and i'm still a little unclear as to the layout... Nutch parses incoming webpages in HTMLParser.java right? i can't seem to find the code in here for url processing though - where exactly does it parse the anchor text and write it to the database? It collects outlinks in DOMContentUtils.getOutlinks. You will need to get the preceding sibling nodes, or a parent node, to collect more of the surrounding text. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-305) Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP
[ http://issues.apache.org/jira/browse/NUTCH-305?page=all ] Stefan Neufeind updated NUTCH-305: -- Attachment: suffix-urlfilter.txt Find attached an suffix-urlfilter.txt that might be interesting to some people. More contributions welcome at any time. Maybe we should ship such a list and use the suffix-filter instead of regex to filter by document-extension? Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP -- Key: NUTCH-305 URL: http://issues.apache.org/jira/browse/NUTCH-305 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: chris finne Attachments: suffix-urlfilter.txt -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Adding new urls in WebDB
Hi all! I have some problems with update my WebDB. I've a page, test.htm, that has 4 links to 4 pdf's documents. I execute the crawler then when I do this command: bin/nutch readdb Mydir/db -stats I get this output: Number of pages: 5 Number of links: 4 That's ok. The problem is when I add more 4 links to the test.htm. I want a script that re crawl or update my WebDB without I have to delete Mydir folder. I hope I am being clearly. I found some shell scripts to do this, however it's don't do what I want. Always I get the same number of pages and links. Anyone can help me? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Adding new urls in WebDB
Lourival Júnior wrote: Hi all! I have some problems with update my WebDB. I've a page, test.htm, that has 4 links to 4 pdf's documents. I execute the crawler then when I do this command: bin/nutch readdb Mydir/db -stats I get this output: Number of pages: 5 Number of links: 4 That's ok. The problem is when I add more 4 links to the test.htm. I want a script that re crawl or update my WebDB without I have to delete Mydir folder. I hope I am being clearly. I found some shell scripts to do this, however it's don't do what I want. Always I get the same number of pages and links. Anyone can help me? Hi, please re-read from the mailinglist-archives as of ... hmm ... yesterday I think. You'll have to do a small modification to be able to re-inject your URL to start re-crawling it on the next run. Otherwise a page will only be re-crawled after a configurable amount of days, which is the same value also used for the PDFs. Regards, Stefan
Re: Adding new urls in WebDB
Hi Stefan, Sorry I don't found the mail that you related :(. Look at this shell script (I'm using the Cygwin in Windows 2000): #!/bin/bash # Set JAVA_HOME to reflect your systems java configuration export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0 # Start index updation bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000 s=`ls -d crawl-LEGISLA/segments/2* | tail -1` echo Segment is $s bin/nutch fetch $s bin/nutch updatedb crawl-LEGISLA /db $s bin/nutch analyze crawl-LEGISLA /db 5 bin/nutch index $s bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile # Merge segments to prevent too many open files exception in Lucene bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds s=`ls -d crawl-LEGISLA/segments/2* | tail -1` echo Merged Segment is $s rm -rf crawl-LEGISLA/index I found it in the wiki page of the nutch project. It has some errors in execution time. I don't know if is it correct... Do you have other example of how to do this job? On 6/9/06, Stefan Neufeind [EMAIL PROTECTED] wrote: Lourival Júnior wrote: Hi all! I have some problems with update my WebDB. I've a page, test.htm, that has 4 links to 4 pdf's documents. I execute the crawler then when I do this command: bin/nutch readdb Mydir/db -stats I get this output: Number of pages: 5 Number of links: 4 That's ok. The problem is when I add more 4 links to the test.htm. I want a script that re crawl or update my WebDB without I have to delete Mydir folder. I hope I am being clearly. I found some shell scripts to do this, however it's don't do what I want. Always I get the same number of pages and links. Anyone can help me? Hi, please re-read from the mailinglist-archives as of ... hmm ... yesterday I think. You'll have to do a small modification to be able to re-inject your URL to start re-crawling it on the next run. Otherwise a page will only be re-crawled after a configurable amount of days, which is the same value also used for the PDFs. Regards, Stefan -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
Thanks, Chris! (And thank you, Andrzej for interpreting my rantings!) That plan sounds fantastic and I would be happy to help out. Scott On Jun 5, 2006, at 1:01 PM, Chris Mattmann wrote: Hi Andrzej, The main problem, as Scott observed, is that the static flag affects all instances of the task executing inside the same JVM. If there are several Fetcher tasks (or any other tasks that check for SEVERE flag!), belonging to different jobs, all of them will quit. This is certainly not the intended behavior. Got it. In fact, I believe that this would make a fantastic anti- pattern. If this kind of behavior is *really* wanted (and I argue that it should not be below), it should be done through an explicit mechanism, not as a side- effect. I have a proposal for a simple solution: set a flag in the current Configuration instance, and check for this flag. The Configuration instance provides a task-specific context persisting throughout the lifetime of a task - but limited only to that task. Voila - problem solved. We get rid of the dubious use of LogFormatter (I hope Chris that even you would agree that this pattern is slightly .. unusual ;) ) What, unusual? Huh? :-) and we gain flexible mechanism limited in scope to the current task, which ensures isolation from other tasks in the same JVM. How about that? +1 I like your proposed solution. I haven't used multiple fetchers really inside the same process too, much however, I do have an application that calls fetches in more of a sequential way in the same JVM. So, I guess I just never ran across the behavior. The thing I like about the proposed solution is its separation and isolation of a task context, which I think that Nutch (now relying on Hadoop as the underlying architectural computing platform) needed to address. So, to summarize, the proposed resolution is: * add flag field in Configuration instance to signify whether or not a SEVERE error has been logged within a task's context * check this field within the fetcher to determine whether or not to stop the fetcher, just for that fetching task identified by its Configuration (and no others) Is this representative of what you're proposing Andrzej? If so, I'd like to take the lead on contributing a small patch that handles this, and then it would be great if people like Scott could test this out in their existing environments where this error was manifesting itself. Thanks! Cheers, Chris (BTW: would you like me to re-open the JIRA issue, or do you want to do it?) __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Adding new urls in WebDB
This one here: http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04829.html Regards, Stefan Lourival Júnior wrote: Hi Stefan, Sorry I don't found the mail that you related :(. Look at this shell script (I'm using the Cygwin in Windows 2000): #!/bin/bash # Set JAVA_HOME to reflect your systems java configuration export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0 # Start index updation bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000 s=`ls -d crawl-LEGISLA/segments/2* | tail -1` echo Segment is $s bin/nutch fetch $s bin/nutch updatedb crawl-LEGISLA /db $s bin/nutch analyze crawl-LEGISLA /db 5 bin/nutch index $s bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile # Merge segments to prevent too many open files exception in Lucene bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds s=`ls -d crawl-LEGISLA/segments/2* | tail -1` echo Merged Segment is $s rm -rf crawl-LEGISLA/index I found it in the wiki page of the nutch project. It has some errors in execution time. I don't know if is it correct... Do you have other example of how to do this job? On 6/9/06, Stefan Neufeind [EMAIL PROTECTED] wrote: Lourival Júnior wrote: Hi all! I have some problems with update my WebDB. I've a page, test.htm, that has 4 links to 4 pdf's documents. I execute the crawler then when I do this command: bin/nutch readdb Mydir/db -stats I get this output: Number of pages: 5 Number of links: 4 That's ok. The problem is when I add more 4 links to the test.htm. I want a script that re crawl or update my WebDB without I have to delete Mydir folder. I hope I am being clearly. I found some shell scripts to do this, however it's don't do what I want. Always I get the same number of pages and links. Anyone can help me? Hi, please re-read from the mailinglist-archives as of ... hmm ... yesterday I think. You'll have to do a small modification to be able to re-inject your URL to start re-crawling it on the next run. Otherwise a page will only be re-crawled after a configurable amount of days, which is the same value also used for the PDFs. Regards, Stefan
Re: Adding new urls in WebDB
Thanks a lot! On 6/9/06, Stefan Neufeind [EMAIL PROTECTED] wrote: This one here: http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04829.html Regards, Stefan Lourival Júnior wrote: Hi Stefan, Sorry I don't found the mail that you related :(. Look at this shell script (I'm using the Cygwin in Windows 2000): #!/bin/bash # Set JAVA_HOME to reflect your systems java configuration export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0 # Start index updation bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000 s=`ls -d crawl-LEGISLA/segments/2* | tail -1` echo Segment is $s bin/nutch fetch $s bin/nutch updatedb crawl-LEGISLA /db $s bin/nutch analyze crawl-LEGISLA /db 5 bin/nutch index $s bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile # Merge segments to prevent too many open files exception in Lucene bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds s=`ls -d crawl-LEGISLA/segments/2* | tail -1` echo Merged Segment is $s rm -rf crawl-LEGISLA/index I found it in the wiki page of the nutch project. It has some errors in execution time. I don't know if is it correct... Do you have other example of how to do this job? On 6/9/06, Stefan Neufeind [EMAIL PROTECTED] wrote: Lourival Júnior wrote: Hi all! I have some problems with update my WebDB. I've a page, test.htm, that has 4 links to 4 pdf's documents. I execute the crawler then when I do this command: bin/nutch readdb Mydir/db -stats I get this output: Number of pages: 5 Number of links: 4 That's ok. The problem is when I add more 4 links to the test.htm. I want a script that re crawl or update my WebDB without I have to delete Mydir folder. I hope I am being clearly. I found some shell scripts to do this, however it's don't do what I want. Always I get the same number of pages and links. Anyone can help me? Hi, please re-read from the mailinglist-archives as of ... hmm ... yesterday I think. You'll have to do a small modification to be able to re-inject your URL to start re-crawling it on the next run. Otherwise a page will only be re-crawled after a configurable amount of days, which is the same value also used for the PDFs. Regards, Stefan -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Nutch logging questions
Hi, I'm currently working on NUTCH-303 so that nutch uses commons logging facade API and log4j as the default implementation. All the code is actually switched to and uses Commons Logging API, and I have replaced some System.out and printStackTrace to make use of Commons Logging. To finalize this patch, my problem is on the configuration: 1. Does the back-end and front-end should have the same logging configuration? 2. What kind of configuration do you think is the best one by default? For now, I have used the same log4 properties than hadoop (see http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254 ) for the back-end, and I was thinking to use the stdout for front-end. What do you think about this? 3. When using the default DRFA appender (Daily Rolling File Appender) in nutch, should I log in the the hadoop log file or in a nutch file? Thanks for your feed-back. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Nutch logging questions
Jérôme Charron wrote: For now, I have used the same log4 properties than hadoop (see http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254 ) for the back-end, and I was thinking to use the stdout for front-end. What do you think about this? We should use console rather than stdout, so that it can be distinguished from application output. http://issues.apache.org/jira/browse/HADOOP-292 Doug
[jira] Updated: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
[ http://issues.apache.org/jira/browse/NUTCH-258?page=all ] Chris A. Mattmann updated NUTCH-258: Attachment: NUTCH-258.Mattmann.060906.patch.txt Hi Folks, Attached is a patch that implements the suggested two fixes to this issue. I had to go through the Nutch code and look for LOG.severe calls, and then add an additional: conf.set(NutchConfiguration.LOG_SEVERE_FIELD, NutchConfiguration.LOG_SEVERE); at the bottom of it. I had to go through several places in the code too where SEVERE errors were being logged and make sure that those pieces of code had access to the Configuration object. I ran unit-level tests and compilation, but no system level tests. Could Scott or someone else who was experiencing this problem test out this patch and then let me know if this fixes the issue? Thanks! Cheers, Chris Once Nutch logs a SEVERE log item, Nutch fails forevermore -- Key: NUTCH-258 URL: http://issues.apache.org/jira/browse/NUTCH-258 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: All Reporter: Scott Ganyo Assignee: Chris A. Mattmann Priority: Critical Attachments: NUTCH-258.Mattmann.060906.patch.txt, dumbfix.patch Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. This is from the run() method in Fetcher.java: public void run() { synchronized (Fetcher.this) {activeThreads++;} // count threads try { UTF8 key = new UTF8(); CrawlDatum datum = new CrawlDatum(); while (true) { if (LogFormatter.hasLoggedSevere()) // something bad happened break;// exit Notice the last 2 lines. This will prevent Nutch from ever Fetching again once this is hit as LogFormatter is storing this data as a static. (Also note that LogFormatter.hasLoggedSevere() is also checked in org.apache.nutch.net.URLFilterChecker and will disable this class as well.) This must be fixed or Nutch cannot be run as any kind of long-running service. Furthermore, I believe it is a poor decision to rely on a logging event to determine the state of the application - this could have any number of side-effects that would be extremely difficult to track down. (As it has already for me.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-306) DistributedSearch.Client liveAddresses concurrency problem
DistributedSearch.Client liveAddresses concurrency problem -- Key: NUTCH-306 URL: http://issues.apache.org/jira/browse/NUTCH-306 Project: Nutch Type: Bug Components: searcher Versions: 0.7, 0.8-dev Reporter: Grant Glouser Priority: Critical Under heavy load, hits returned by DistributedSearch.Client can become out of sync with the Client's live server list. DistributedSearch.Client maintains an array of live search servers (liveAddresses). This array is updated at intervals by a watchdog thread. When the Client returns hits from a search, it tracks which hits came from which server by saving an index into the liveAddresses array (as Hit.indexNo). The problem occurs when the search servers cannot service some remote procedure calls before the client times out (due to heavy load, for example). If the Client returns some Hits from a search, and then the array of liveAddresses changes while the Hits are still being used, the indexNos for those Hits can become invalid, referring to different servers than the Hit originated from (or no server at all!). Symptoms of this problem include: - ArrayIndexOutOfBoundsException (when the array of liveAddresses shrinks, a Hit from the last server in liveAddresses in the previous update cycle now has an indexNo past the end of the array) - IOException: read past EOF (suppose a hit comes back from server A with a doc number of 1000. Then the watchdog thread updates liveAddresses and now the Hit looks like it came from server B, but server B only has 900 documents. Trying to get details for the hit will read past EOF in server B's index.) - Of course, you could also get a silent failure in which you find a hit on server A, but the details/summary are fetched from server B. To the user, it would simply look like an incorrect or nonsense hit. We have solved this locally by removing the liveAddresses array. Instead, the watchdog thread updates an array of booleans (same size as the array of defaultAddresses) that indicate whether that address responded to the latest call from the watchdog thread. Hit.indexNo is then always an index into the complete array of defaultAddresses, so it is stable and always valid. Callers of getDetails()/getSummary()/etc. must still be aware that these methods may return null when the corresponding server is unable to respond. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-306) DistributedSearch.Client liveAddresses concurrency problem
[ http://issues.apache.org/jira/browse/NUTCH-306?page=all ] Grant Glouser updated NUTCH-306: Attachment: DistributedSearch.java-patch DistributedSearch.Client liveAddresses concurrency problem -- Key: NUTCH-306 URL: http://issues.apache.org/jira/browse/NUTCH-306 Project: Nutch Type: Bug Components: searcher Versions: 0.7, 0.8-dev Reporter: Grant Glouser Priority: Critical Attachments: DistributedSearch.java-patch Under heavy load, hits returned by DistributedSearch.Client can become out of sync with the Client's live server list. DistributedSearch.Client maintains an array of live search servers (liveAddresses). This array is updated at intervals by a watchdog thread. When the Client returns hits from a search, it tracks which hits came from which server by saving an index into the liveAddresses array (as Hit.indexNo). The problem occurs when the search servers cannot service some remote procedure calls before the client times out (due to heavy load, for example). If the Client returns some Hits from a search, and then the array of liveAddresses changes while the Hits are still being used, the indexNos for those Hits can become invalid, referring to different servers than the Hit originated from (or no server at all!). Symptoms of this problem include: - ArrayIndexOutOfBoundsException (when the array of liveAddresses shrinks, a Hit from the last server in liveAddresses in the previous update cycle now has an indexNo past the end of the array) - IOException: read past EOF (suppose a hit comes back from server A with a doc number of 1000. Then the watchdog thread updates liveAddresses and now the Hit looks like it came from server B, but server B only has 900 documents. Trying to get details for the hit will read past EOF in server B's index.) - Of course, you could also get a silent failure in which you find a hit on server A, but the details/summary are fetched from server B. To the user, it would simply look like an incorrect or nonsense hit. We have solved this locally by removing the liveAddresses array. Instead, the watchdog thread updates an array of booleans (same size as the array of defaultAddresses) that indicate whether that address responded to the latest call from the watchdog thread. Hit.indexNo is then always an index into the complete array of defaultAddresses, so it is stable and always valid. Callers of getDetails()/getSummary()/etc. must still be aware that these methods may return null when the corresponding server is unable to respond. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
0.8 release
How would folks feel about releasing 0.8 now, there has been quite a lot of improvements/new features since 0.7 series and I strongly feel that we should push the first 0.8 series release (alfa/beta) out the door now. It would IMO lower the barrier to first timers try the 0.8 series and that would give us more feedback about the overall quality. If there is a consensus about this I can volunteer to be the RM. -- Sami Siren