Re: bug with generate performance
Doğacan Güney wrote: Others have also reported a problem with generate performance. It seems we have a problem here but I can not reproduce this behaviour so I am not sure what causes it. Can you open a JIRA issue and enter your comments there? Also, how you are running generate will be very helpful (what is generate.max.per.host? what is -topN argument, etc.) Also the value of generate.max.per.host.by.ip - this could be a DNS-related issue. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Meta Tags and Indexing
Hello everyone, I'm working on a project that is essentially a searchable database of citations (academic ones). Nutch, naturally, was the searching tool I decided to use because of it's full-featuredness. And cost. Anyway, we had the requirement to be able to sort results by year (for instance) and restrict results based on type (journal article, book, etc.). I couldn't find a way to label the results (to use a Google term), so I ended up writing a plugin to do so. Based heavily on the example plugin for nutch, this plugin adds data found in HTML page meta fields to the index, and allows one to optionally use them in querying (or sorting). Not as full-featured as google's labeling solution, perhaps, but it's a start. Just thought I'd post it in case it saves anybody else some time... Link: http://upclose.lrdc.pitt.edu/people/maki_assets/metatag.tar.gz Thanks! -Jeff
Re: Limiting outlink tags.
Hi Marcin, On 9/7/07, Marcin Okraszewski [EMAIL PROTECTED] wrote: Hi, I have noticed that Nutch considers img/@src as an outlink. I suppose in many cases people do not want to threat image as an outlink. At least I don't want. The same case is with script/@src. But, it seems there is no way to limit outlink tags. The DOMContentUtils.getOutlinks() takes links from all a,area,form,frame,iframe,script,link,img. Only form element can be turned off by parser.html.form.use_action parameter. I would suggest to introduce a new configuration parameter which could be used to turn on or off certain elements. It could be simply done by single parameter, which would contain coma separated list of tags to be turned off. What is your opinion? If you think it is a valid issue I can make a patch for this. There is already NUTCH-488 open for this (with a patch). Feel free to add comments/patches/etc. there. Btw, I agree that using a CSV is better than using a new configuration parameter for every tag. Regards, Marcin -- Doğacan Güney
Re: bug with generate performance
Hi, On 8/31/07, misc [EMAIL PROTECTED] wrote: Hello- I am almost certain I have found a nasty bug with nutch genereate. Problem: Nutch generate can take many hours, even a day to complete (on a crawldb that has less than 2 million urls). I added debug code to Generator-Selector.map to see when map is called and returns, and observed interesting behavior, described here: 1. Most of the time, when generate is run urls are processed in chunky batches, usually about 40 at a time, followed by a 1 second delay. I timed the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.) When this happens it takes hours to complete. 2. Sometimes (randomly as far as I can tell) when I run nutch, the urls are processed without delays. It is an all or nothing event, either I run and all urls process quickly without delay (in minutes), or more likely I get the chunky processing with many 1 second delays and the program takes hours to end. The one exception is 3. When the processing runs quickly I've seen the main thread end (I have some profiling going, so I know when a thread ends), and then more likely than not a second thread begins where the first starts, chunky like usual. Although I sometimes can get fast processing in one thread, it is almost impossible for me te get it in all threads and therefore general processing is very slow (hours). 4. I tried to put in more debug code to find the line where the delays occured, but the last line printed to the log at a delay seemed random, leading me to believe that the log is not being flushed uniformly. 5. The profiler I used seemed to imply that about 100% of the time was spent in javallang.Thread.sleep. I am not completely familiar with the profiler I used so I am not completely sure I inturpreted this correctly. I will keep debugging here, but perhaps someone here has some insight into what might be happening? Others have also reported a problem with generate performance. It seems we have a problem here but I can not reproduce this behaviour so I am not sure what causes it. Can you open a JIRA issue and enter your comments there? Also, how you are running generate will be very helpful (what is generate.max.per.host? what is -topN argument, etc.) thanks -J -- Doğacan Güney
[jira] Created: (NUTCH-550) Parse fails if db.max.outlinks.per.page is -1
Parse fails if db.max.outlinks.per.page is -1 - Key: NUTCH-550 URL: https://issues.apache.org/jira/browse/NUTCH-550 Project: Nutch Issue Type: Bug Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: NUTCH-550.patch See here http://www.nabble.com/nutch-nightly%3A-IllegalArgumentException%3A-Illegal-Capacity%3A--1-tf4245360.html#a12511661 One of (my|the) earlier commits broke ParseOutputFormat such that if db.max.outlinks.per.page is -1, ParseOutputFormat tries to create an ArrayList of size -1. The solution is to simply make maxOutlinks variable Integer.MAX_VALUE if db.max.outlinks.per.page is -1 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Labeling URLs a-la Google
Hi Jeff, Nice. Could you submit this to JIRA as a patch? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Jeff Maki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, September 6, 2007 4:04:18 PM Subject: Labeling URLs a-la Google Hello everybody, I'm working on a project that is essentially a searchable database for academic citations at the University of Pittsburgh. One of our searching requirements was to be able to break the search results into sections--in order to do this, I implemented something similar to Google's labels. It's based heavily on the example plugin, and maybe not so pretty code-wise, but it's a start. Downloadable here: http://upclose.lrdc.pitt.edu/people/maki_assets/nutch-regex-label.tar.gz You configure it by adding something like the below to your nutch-site.xml file: property nameextension.regexlabeler.labels/name value http://dev3\.informalscience\.org/research.*\.php.* = firsttag secondtag thirdtag, http://dev3\.informalscience\.org/project.*\.php.* = project, http://www.?\.informalscience\.org.* = oldsite, http://dev3\.informalscience\.org.* = devsite /value /property Notes: * Format of each line is regular expression=labels, space delimited * URLS must be unique. * Multiple tags for the same pattern are delimited by a space. Hope this saves somebody some time, -Jeff (BTW, Nutch as worked very well for us--excellent project!)
[jira] Created: (NUTCH-551) performance for generate is often really bad
performance for generate is often really bad Key: NUTCH-551 URL: https://issues.apache.org/jira/browse/NUTCH-551 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 0.9.0 Environment: Ubuntu, Core duo 2.4GhZ, 1 gig ram, 750GB hard drive. The ethernet connection has a dedicated 1gb connection to the web, so certainly that isn't a problem. I have tested on nutch 0.9 and the newest daily build from 2007-08-28. Reporter: Jim Generate often takes many hours to finish (6+), where I would expect it to be done in minutes. This behavior has been observed for topN of small (~100) and large (~100) values. Other configuration values are generate.max.per.host: -1 generate.max.per.host.by.ip: false I added debug code to Generator-Selector.map to see when map is called and returns, and observed interesting behavior, described here: 1. Most of the time, when generate is run urls are processed in chunky batches, usually about 40 at a time, followed by a 1 second delay. I timed the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.) When this happens it takes hours to complete. 2. Sometimes (randomly as far as I can tell) when I run nutch, the urls are processed without delays. It is an all or nothing event, either I run and all urls process quickly without delay (in minutes), or more likely I get the chunky processing with many 1 second delays and the program takes hours to end. The one exception is 3. When the processing runs quickly I've seen the main thread end (I have some profiling going, so I know when a thread ends), and then more likely than not a second thread begins where the first starts, chunky like usual. Although I sometimes can get fast processing in one thread, it is almost impossible for me te get it in all threads and therefore general processing is very slow (hours). 4. I tried to put in more debug code to find the line where the delays occured, but the last line printed to the log at a delay seemed random, leading me to believe that the log is not being flushed uniformly. The timestamps in the log always indicate that the delay is wither right before or after the first log item in the map function. 5. The profiler I used seemed to imply that about 100% of the time was spent in javallang.Thread.sleep. I am not completely familiar with the profiler I used so I am not completely sure I inturpreted this correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: bug with generate performance
Hello- I've made a bug, and included the extra required information (generate.max.per.host = -1, error seen with small topN around 100 and large topN around 100). I've since tried to run with a debugger, but the slowness went away (ugh). I also know that dns lookups are not the problem as I ran with wireshark running and there were no dns lookups. thanks -Jim Others have also reported a problem with generate performance. It seems we have a problem here but I can not reproduce this behaviour so I am not sure what causes it. Can you open a JIRA issue and enter your comments there? Also, how you are running generate will be very helpful (what is generate.max.per.host? what is -topN argument, etc.)
[jira] Commented: (NUTCH-551) performance for generate is often really bad
[ https://issues.apache.org/jira/browse/NUTCH-551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525882 ] Jim commented on NUTCH-551: --- It is really maddening, but I can not reproduce the bug with the jdb debugger attached. Whenever I run with jdb generate just finishes immediately (in minutes). On the bright side I am now certain that there *is* a bug, because I can see from start to finish how long generate should take (minutes as opposed to hours). Also, I've been watching the map/reduce log progress, and also the output log at the same time and have verified that the chunkyness has something to do with the slowdown. The logs show a progression of the map in a steady fashion until the logs start pausing every second for a second. Then the percentage only goes very slowly. performance for generate is often really bad Key: NUTCH-551 URL: https://issues.apache.org/jira/browse/NUTCH-551 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 0.9.0 Environment: Ubuntu, Core duo 2.4GhZ, 1 gig ram, 750GB hard drive. The ethernet connection has a dedicated 1gb connection to the web, so certainly that isn't a problem. I have tested on nutch 0.9 and the newest daily build from 2007-08-28. Reporter: Jim Generate often takes many hours to finish (6+), where I would expect it to be done in minutes. This behavior has been observed for topN of small (~100) and large (~100) values. Other configuration values are generate.max.per.host: -1 generate.max.per.host.by.ip: false I added debug code to Generator-Selector.map to see when map is called and returns, and observed interesting behavior, described here: 1. Most of the time, when generate is run urls are processed in chunky batches, usually about 40 at a time, followed by a 1 second delay. I timed the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.) When this happens it takes hours to complete. 2. Sometimes (randomly as far as I can tell) when I run nutch, the urls are processed without delays. It is an all or nothing event, either I run and all urls process quickly without delay (in minutes), or more likely I get the chunky processing with many 1 second delays and the program takes hours to end. The one exception is 3. When the processing runs quickly I've seen the main thread end (I have some profiling going, so I know when a thread ends), and then more likely than not a second thread begins where the first starts, chunky like usual. Although I sometimes can get fast processing in one thread, it is almost impossible for me te get it in all threads and therefore general processing is very slow (hours). 4. I tried to put in more debug code to find the line where the delays occured, but the last line printed to the log at a delay seemed random, leading me to believe that the log is not being flushed uniformly. The timestamps in the log always indicate that the delay is wither right before or after the first log item in the map function. 5. The profiler I used seemed to imply that about 100% of the time was spent in javallang.Thread.sleep. I am not completely familiar with the profiler I used so I am not completely sure I inturpreted this correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Pl...Give me example
hi all... am new to new to nutch.. i need sample java program for nutch crawl and search cud any1 guide me reg dis. or where can i get code. i got nutch-0.9 war and i deployed it...but am not able to understand. any1 help me pl... -- View this message in context: http://www.nabble.com/Pl...Give-me-example-tf4404885.html#a12566694 Sent from the Nutch - Dev mailing list archive at Nabble.com.