[jira] Commented: (NUTCH-265) Getting Clustered results in better form.
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413220 ] Dawid Weiss commented on NUTCH-265: --- If you just mean the user interface, then you can simply take the XSLT stylesheet from Carrot2 and reuse it in Nutch with the opensearch XML -- I believe there is even an example in Carrot2 of using opensearch, so you shouldn't have much troubles. Now, the phrases you wish to see on your screen won't always be so beautiful because search results clustering works on snippets extracted from search results. If you want clean and accurate labels then you'd need to use a predefined ontology or something -- I can't help you with that. Try playing around with Carrot2 demo and see if the results satisfy your needs. If so, then rewriting Nutch's user interface to suit your needs shouldn't be a problem. If your expectations are more demanding then you'll need to think of some other solution. > Getting Clustered results in better form. > - > > Key: NUTCH-265 > URL: http://issues.apache.org/jira/browse/NUTCH-265 > Project: Nutch > Type: Improvement > Components: searcher > Versions: 0.7.2 > Reporter: Kris K > > The cluster results are coming with title and link to URL. For improvement it > should be clustered keyword phrases (Like Vivisimo type). Any person can > share their views on it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-265) Getting Clustered results in better form.
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413214 ] Kris K commented on NUTCH-265: -- Dear Dawid, Yeah I want the same interface as you showed me before but I am not able to do that. My area of concern is like that: If I am searching for keyword "Java" the clustered results should come in the following format Java SourceCode Java Book Java Compiler Java Programming Java Server Java Technology Java Servlets Java Applets JavaScript . I really appreciate you for your help to provide me the right direction. > Getting Clustered results in better form. > - > > Key: NUTCH-265 > URL: http://issues.apache.org/jira/browse/NUTCH-265 > Project: Nutch > Type: Improvement > Components: searcher > Versions: 0.7.2 > Reporter: Kris K > > The cluster results are coming with title and link to URL. For improvement it > should be clustered keyword phrases (Like Vivisimo type). Any person can > share their views on it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-285) LinkDb Fails rename doesn't create parent directories
[ http://issues.apache.org/jira/browse/NUTCH-285?page=all ] Andrzej Bialecki closed NUTCH-285: --- Fix Version: 0.8-dev Resolution: Fixed Fixed, I also applied the same fix in CrawlDb, which suffered from the same problem. > LinkDb Fails rename doesn't create parent directories > - > > Key: NUTCH-285 > URL: http://issues.apache.org/jira/browse/NUTCH-285 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Environment: Windows XP Media Center 2005, Fedora Core 5, Java 5, DFS > Reporter: Dennis Kubes > Fix For: 0.8-dev > Attachments: create_linkdb_dirs.patch > > The LinkDb install method fails to correctly rename (move) the LinkDb working > directory to the final directory if the parent directories do not exist. > For example if I am creating a linkdb by the name of crawl/linkdb the install > method trys to rename the working linkdb directory (something like > linkdb-20060523 in root of DFS) to crawl/linkdb/current. But if the > crawl/linkdb directory does not already exist then the rename fails and the > linkdb-20060523 working directory stays in the root directory of the DFS for > the user. > The attached patch adds a mkdirs command to the install method to ensure that > the parent directories exist before trying to rename. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-286) Handling common error-pages as 404
Handling common error-pages as 404 -- Key: NUTCH-286 URL: http://issues.apache.org/jira/browse/NUTCH-286 Project: Nutch Type: Improvement Reporter: Stefan Neufeind Idea: Some pages from some software-packages/scripts report an "http 200 ok" even though a specific page could not be found. Example I just found is: http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef That's a typo3-page explaining in it's standard-layout and wording: "The requested page did not exist or was inaccessible." So I had the idea if somebody might create a plugin that could find commonly used formulations for "page does not exist" etc. and turn the page into a 404 before feeding them into the nutch-index - although the server responded with status 200 ok. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Mailing List nutch-agent Reports of Bots Submitting Forms
Ken Krugler wrote: Jeremy Bensley wrote: There are posts every three or four days to the nutch-agent regarding bots submitting empty forms to websites. I don't think I've seen any regular devs reply in-list to these issues, and am just wondering if these cases are being analyzed. 1. Is there a known (resolved or current) bug regarding Nutch submitting forms? I could find no bug listings in JIRA for this. If it is known and resolved, what versions of the bot exhibit this behavior? Yes, there was a discussion on the list about this - I'm afraid this behavior is present in both 0.7.x and 0.8. I'm going to remove the offending code (or to put it as an option, turned off by default). I think the biggest issue is following links for a form POST. This definitely seems wrong to me, and thus should never be done. I don't think this is happening anymore, there is an explicit check for POST method in DOMContentUtils that should prevent this. However, some horribly broken HTML may be fooling Neko or TagSoup, so that they lose the 'method' attribute (in which case it defaults to GET). There's a separate issue re whether it's OK to follow form links that do a GET, since that's what the guy complained to us about recently. He agreed that his form should be doing a POST, since it triggers a massive build process, but he also said that no other crawl besides Nutch was following these links. I could see making that a configurable option, where it was false by default. But we'd probably need to modify this setting to be domain-specific, ie some sites we crawl require us to follow these types of links to get at content, but in general we'd want to not follow them. For now I modified the code to skip form action URLs, depending on a boolean option. I'll commit this in a moment. This brings up an issue I've been thinking about. It might make sense to require everybody set the user-agent string, versus it having default values that point to Nutch. The first time you run Nutch, it would display an error re the user-agent string not being set, but if the instructions for how to do this were explicit, this wouldn't be much of a hardship for anybody trying it out. I could write up some quick text for the Wiki re what a good user agent string should contain, and what should be on the web page that it refers to, since we also went through that same process not too long ago. I like this idea. I know that I've been guilty of this in the past, out of pure laziness ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Extract infos from documents and query external sites
HellSpawn wrote: > Hi all, I'm new :) > > I have to extract some informations from an address book in my site > (example: names and surnames) and then use it to build queries on sites like > scholar.google.com, indexing the result page with my crawler. Can I do it? > How? Not "out of the box". You'd have to figure out building query-strings (I assume they use GET-parameters) from your addressbook, and you could then "index" those URLs. For me the question though remains why you'd want to do that - but you could :-) Stefan
Re: Mailing List nutch-agent Reports of Bots Submitting Forms
Jeremy Bensley wrote: There are posts every three or four days to the nutch-agent regarding bots submitting empty forms to websites. I don't think I've seen any regular devs reply in-list to these issues, and am just wondering if these cases are being analyzed. 1. Is there a known (resolved or current) bug regarding Nutch submitting forms? I could find no bug listings in JIRA for this. If it is known and resolved, what versions of the bot exhibit this behavior? Yes, there was a discussion on the list about this - I'm afraid this behavior is present in both 0.7.x and 0.8. I'm going to remove the offending code (or to put it as an option, turned off by default). I think the biggest issue is following links for a form POST. This definitely seems wrong to me, and thus should never be done. There's a separate issue re whether it's OK to follow form links that do a GET, since that's what the guy complained to us about recently. He agreed that his form should be doing a POST, since it triggers a massive build process, but he also said that no other crawl besides Nutch was following these links. I could see making that a configurable option, where it was false by default. But we'd probably need to modify this setting to be domain-specific, ie some sites we crawl require us to follow these types of links to get at content, but in general we'd want to not follow them. 2. Are the Nutch Devs replying to the emails sent to this list? I could understand if they are replying off-list, but to an outside observer such as myself it appears as though webmasters are not getting many replies to their inqueries. I can speak for myself only .. I'm not tracking that list. What about others? I did respond to John Masone at MacFixer.net, to get the URL to the form where Nutch was triggering a submit. So just FYI for testing the fix, it's: http://www.macfixer.net/contact I don't mean to be alarmist, but I think it is in the community's best interests to make sure that these kinds of complaints get resolved such that nutch is a good 'citizen' and isn't blacklisted from searching sites. Of course you are right, there is no ill will here on our part, just a long queue of issues to address ... but it seems we have to prioritize this one. This brings up an issue I've been thinking about. It might make sense to require everybody set the user-agent string, versus it having default values that point to Nutch. The first time you run Nutch, it would display an error re the user-agent string not being set, but if the instructions for how to do this were explicit, this wouldn't be much of a hardship for anybody trying it out. I could write up some quick text for the Wiki re what a good user agent string should contain, and what should be on the web page that it refers to, since we also went through that same process not too long ago. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers"
[jira] Updated: (NUTCH-285) LinkDb Fails rename doesn't create parent directories
[ http://issues.apache.org/jira/browse/NUTCH-285?page=all ] Dennis Kubes updated NUTCH-285: --- Attachment: create_linkdb_dirs.patch Patch to add mkdirs call to install method of LinkDb to ensure parent directories exist before attempting rename. > LinkDb Fails rename doesn't create parent directories > - > > Key: NUTCH-285 > URL: http://issues.apache.org/jira/browse/NUTCH-285 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Environment: Windows XP Media Center 2005, Fedora Core 5, Java 5, DFS > Reporter: Dennis Kubes > Attachments: create_linkdb_dirs.patch > > The LinkDb install method fails to correctly rename (move) the LinkDb working > directory to the final directory if the parent directories do not exist. > For example if I am creating a linkdb by the name of crawl/linkdb the install > method trys to rename the working linkdb directory (something like > linkdb-20060523 in root of DFS) to crawl/linkdb/current. But if the > crawl/linkdb directory does not already exist then the rename fails and the > linkdb-20060523 working directory stays in the root directory of the DFS for > the user. > The attached patch adds a mkdirs command to the install method to ensure that > the parent directories exist before trying to rename. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-285) LinkDb Fails rename doesn't create parent directories
LinkDb Fails rename doesn't create parent directories - Key: NUTCH-285 URL: http://issues.apache.org/jira/browse/NUTCH-285 Project: Nutch Type: Bug Versions: 0.8-dev Environment: Windows XP Media Center 2005, Fedora Core 5, Java 5, DFS Reporter: Dennis Kubes Attachments: create_linkdb_dirs.patch The LinkDb install method fails to correctly rename (move) the LinkDb working directory to the final directory if the parent directories do not exist. For example if I am creating a linkdb by the name of crawl/linkdb the install method trys to rename the working linkdb directory (something like linkdb-20060523 in root of DFS) to crawl/linkdb/current. But if the crawl/linkdb directory does not already exist then the rename fails and the linkdb-20060523 working directory stays in the root directory of the DFS for the user. The attached patch adds a mkdirs command to the install method to ensure that the parent directories exist before trying to rename. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Mailing List nutch-agent Reports of Bots Submitting Forms
Jeremy Bensley wrote: There are posts every three or four days to the nutch-agent regarding bots submitting empty forms to websites. I don't think I've seen any regular devs reply in-list to these issues, and am just wondering if these cases are being analyzed. 1. Is there a known (resolved or current) bug regarding Nutch submitting forms? I could find no bug listings in JIRA for this. If it is known and resolved, what versions of the bot exhibit this behavior? Yes, there was a discussion on the list about this - I'm afraid this behavior is present in both 0.7.x and 0.8. I'm going to remove the offending code (or to put it as an option, turned off by default). 2. Are the Nutch Devs replying to the emails sent to this list? I could understand if they are replying off-list, but to an outside observer such as myself it appears as though webmasters are not getting many replies to their inqueries. I can speak for myself only .. I'm not tracking that list. What about others? I don't mean to be alarmist, but I think it is in the community's best interests to make sure that these kinds of complaints get resolved such that nutch is a good 'citizen' and isn't blacklisted from searching sites. Of course you are right, there is no ill will here on our part, just a long queue of issues to address ... but it seems we have to prioritize this one. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Mailing List nutch-agent Reports of Bots Submitting Forms
There are posts every three or four days to the nutch-agent regarding bots submitting empty forms to websites. I don't think I've seen any regular devs reply in-list to these issues, and am just wondering if these cases are being analyzed. 1. Is there a known (resolved or current) bug regarding Nutch submitting forms? I could find no bug listings in JIRA for this. If it is known and resolved, what versions of the bot exhibit this behavior? 2. Are the Nutch Devs replying to the emails sent to this list? I could understand if they are replying off-list, but to an outside observer such as myself it appears as though webmasters are not getting many replies to their inqueries. I don't mean to be alarmist, but I think it is in the community's best interests to make sure that these kinds of complaints get resolved such that nutch is a good 'citizen' and isn't blacklisted from searching sites. Thanks, Jeremy
[jira] Created: (NUTCH-284) NullPointerException during index
NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this "reduce > sort" has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce > sort 060524 212614 reduce > sort 060524 212615 reduce > sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-70) duplicate pages - virtual hosts in db.
[ http://issues.apache.org/jira/browse/NUTCH-70?page=comments#action_12413169 ] Stefan Neufeind commented on NUTCH-70: -- Is the content exactly the same? Maybe could the page be checked against an already existing one by an MD5 on the content? But I'm not sure if there is a clean way to workaround the problem - what if all pages are the same except one, on the other vhost? Would have to crawl all anyway, wouldn't you? > duplicate pages - virtual hosts in db. > -- > > Key: NUTCH-70 > URL: http://issues.apache.org/jira/browse/NUTCH-70 > Project: Nutch > Type: Bug > Environment: 0,7 dev > Reporter: YourSoft > > Dear Developers, > I have a problem with nutch: > - There are many sites duplicates in the webdb and in the segments. > The source of this problem is: > - If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu, > origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages are the > same, only the inlinks are differents. > - The ip address is the same. > - When search, all virtualhosts are in the results. > Google only show one of these virtual hosts, the nutch show all. The result > nutch db is larger, and this case slower, than google. > Have any idea, how to remove these duplicates? > Regards, > Ferenc -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Querying a site by extracting doc informations
Hi all, I'm new here :) I have to extract some informations from a crawled document in my site (for example: name and surname), using them to build a query for another site (like scholar.google.com) making the result indexed by nutch. Can I do this? How? Thank you Rosario Salatiello
[jira] Commented: (NUTCH-44) too many search results
[ http://issues.apache.org/jira/browse/NUTCH-44?page=comments#action_12413155 ] Stefan Neufeind commented on NUTCH-44: -- hi, any progress on this? > too many search results > --- > > Key: NUTCH-44 > URL: http://issues.apache.org/jira/browse/NUTCH-44 > Project: Nutch > Type: Bug > Components: web gui > Environment: web environment > Reporter: Emilijan Mirceski > > There should be a limitation (user defined) on the number of results the > search engine can return. > For example, if one modifies the seach url as: > http:///search.jsp?query=&hitsPerPage=2&hitsPerSite=0 > The search will try to return 20,000 pages which isn't good for the server > side performance. > Is it possible to have a setting in the config xml files to control this? > Thanks, > Emilijan -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads
If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads Key: NUTCH-283 URL: http://issues.apache.org/jira/browse/NUTCH-283 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Scott Ganyo Attachments: patch.txt If a Fetcher has chosen to time out and has abandoned outstanding Fetcher Threads, resources that those Fetcher Threads may be using are closed. This naturally causes any abandoned Fetcher Threads to fail when they later attempt to finish up their work in progress. I have a patch that addresses this that I am attaching. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads
[ http://issues.apache.org/jira/browse/NUTCH-283?page=all ] Scott Ganyo updated NUTCH-283: -- Attachment: patch.txt > If the Fetcher times out and abandons Fetcher Threads, severe errors will > occur on those Threads > > > Key: NUTCH-283 > URL: http://issues.apache.org/jira/browse/NUTCH-283 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Reporter: Scott Ganyo > Attachments: patch.txt > > If a Fetcher has chosen to time out and has abandoned outstanding Fetcher > Threads, resources that those Fetcher Threads may be using are closed. This > naturally causes any abandoned Fetcher Threads to fail when they later > attempt to finish up their work in progress. > I have a patch that addresses this that I am attaching. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Extract infos from documents and query external sites
Hi all, I'm new :) I have to extract some informations from an address book in my site (example: names and surnames) and then use it to build queries on sites like scholar.google.com, indexing the result page with my crawler. Can I do it? How? Thank you Rosario Salatiello -- View this message in context: http://www.nabble.com/Extract+infos+from+documents+and+query+external+sites-t1675003.html#a4541042 Sent from the Nutch - Dev forum at Nabble.com.
Fetcher and MapReduce
Hi, I'm trying to crawl approx. 500.000 urls. After inject and generate I started fetchers using 6 map tasks and 3 reduce tasks. All the map tasks had successfully completed while all the reduce tasks got an OutOfMemory exception. This exception was caught after the append phase (during the sort phase). As far as I observed, during a fetch operation, all the map tasks outputs to a temp. sequence file. During the reduce operation, each reducer copies all map outputs to their local disk and append them to a single seq. file. After this operation reducer try to sort this file and output the sorted file to its local disk. And then, a record writer is opened to write this sorted file to the segment, which is in DFS. If this scenario is correct, then all the reduce tasks are supposed to do the same job. All try to sort the whole map outputs and the winner of this operation will be able to write to dfs. So only one reducer is expected to write to dfs. If this is the case then an OutOfMemory exception will not be surprising for 500.000+urls. Since reducers will try to sort a file bigger then 1GB. Any comments on this scenario are welcome. And how can I avoid these exceptions? Thanx, -- Hamza KAYA