[jira] Created: (NUTCH-462) Noarchive urls are available via the cache link
Noarchive urls are available via the cache link --- Key: NUTCH-462 URL: https://issues.apache.org/jira/browse/NUTCH-462 Project: Nutch Issue Type: Bug Components: web gui Reporter: Steve Severance Fix For: 0.8.1 If a robots.txt file specifies a Noarchive statement then urls that or contained as part of that path should not be available via the cached link. For example Noarchive:/ means that no pages should be available via the cached link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Issues pending before 0.9 release
Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: NUTCH-400 Fixed. NUTCH-353 Moved to Major, fix after release. NUTCH-233 Fixed. NUTCH-436 Fixed. NUTCH-427 Moved to Major, fix after release. NUTCH-381 Won't fix - this is a configuration issue. NUTCH-277 Cannot reproduce NUTCH-167 Fixed. Any other stuff we need to fix before the release? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Closed: (NUTCH-450) How to set up nutch
[ https://issues.apache.org/jira/browse/NUTCH-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-450. --- Resolution: Invalid Assignee: Andrzej Bialecki This belongs in nutch-user mailing list, please seek help there. > How to set up nutch > --- > > Key: NUTCH-450 > URL: https://issues.apache.org/jira/browse/NUTCH-450 > Project: Nutch > Issue Type: Task > Components: administration gui > Environment: Windows XP >Reporter: Sandya S Murthy > Assigned To: Andrzej Bialecki > > I have followed the instruction given in nutch tutorial to set up the nutch, > I installed J2sdk , tomcat 4.x, and cygwin after that i download the nutch > version 0.7.2. but i did'nt understand how to connect this nutch to the root > directory, how to set up this downloaded nutch folder. > pls help -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-451) Tool to recover partial fetcher output
[ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-451: Priority: Minor (was: Major) > Tool to recover partial fetcher output > -- > > Key: NUTCH-451 > URL: https://issues.apache.org/jira/browse/NUTCH-451 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki >Priority: Minor > Fix For: 0.9.0 > > Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java > > > This class may help you to recover partial data from a failed Fetcher run. > NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. > you didn't use DFS - partial output to DFS is permanently lost if a process > fails to properly close the output streams. > NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial > SequenceFile-s will be corrupted at the end. This means that it won't be > possible to recover all data from them - most likely only the data up to the > last sync marker can be recovered. > The recovery proces requires some preparation: > * determine the map directories corresponding to the map task outputs of the > failed job. These map directories contain SequenceFile-s consisting of pairs > of , named e.g. part-0.out, or file.out, or spill0.out. > * create the new input directory, let's say input/. Copy all SequenceFile-s > into this directory, renaming them sequentially like this: > input/part-0 > input/part-1 > input/part-2 > input/part-3 > ... > > * specify the "input" directory as the input to this tool. > If all goes well, a new segment will be created as a subdirectory of the > output dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-353: Priority: Major (was: Blocker) This i partially fixed so that page status is consistent. LinkDb related changes will be implemented later. > pages that serverside forwards will be refetched every time > --- > > Key: NUTCH-353 > URL: https://issues.apache.org/jira/browse/NUTCH-353 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8.1, 0.9.0 >Reporter: Stefan Groschupf > Assigned To: Andrzej Bialecki > Fix For: 0.9.0 > > Attachments: doNotRefecthForwarderPagesV1.patch > > > Pages that do a serverside forward are not written with a status change back > into the crawlDb. Also the nextFetchTime is not changed. > This causes a refetch of the same page again and again. The result is nutch > is not polite and refetching the forwarding and target page in each segment > iteration. Also it effects the scoring since the forward page contribute it's > score to all outlinks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-459) Upgrade Nutch to Hadoop 0.12.1
[ https://issues.apache.org/jira/browse/NUTCH-459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-459. --- Resolution: Fixed Upgraded to 0.12.1 release. > Upgrade Nutch to Hadoop 0.12.1 > -- > > Key: NUTCH-459 > URL: https://issues.apache.org/jira/browse/NUTCH-459 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.9.0 > Environment: All platforms >Reporter: Dennis Kubes > Assigned To: Dennis Kubes > Fix For: 0.9.0 > > Attachments: hadoop-0.12.1-dev-core.jar > > > This JIRA contains the new hadoop-0.12.1-dev-core.jar as of revision 518636. > I far as I can tell this jar doesn't break any of the current Nutch trunk > code as of revision 517382. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-277) Fetcher dies because of "max. redirects" (avoiding infinite loop)
[ https://issues.apache.org/jira/browse/NUTCH-277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-277. --- Resolution: Cannot Reproduce Fix Version/s: 0.9.0 Assignee: Andrzej Bialecki Cannot reproduce this. If the problem reappears please create a new issue. > Fetcher dies because of "max. redirects" (avoiding infinite loop) > - > > Key: NUTCH-277 > URL: https://issues.apache.org/jira/browse/NUTCH-277 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.8 > Environment: nightly-2006-05-20 >Reporter: Stefan Neufeind > Assigned To: Andrzej Bialecki >Priority: Critical > Fix For: 0.9.0 > > > Error in the logs is: > 060521 213401 SEVERE Narrowly avoided an infinite loop in execute > org.apache.commons.httpclient.RedirectException: Maximum redirects (100) > exceeded > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:183) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324) > at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:87) > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:97) > at > org.apache.nutch.protocol.http.api.RobotRulesParser.isAllowed(RobotRulesParser.java:394) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:173) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135) > This happens during normal crawling. Unfortunately I don't know how to > further track this down. But it's problematic, since it actually makes the > fetcher die. > Workaround (for the symptom) is in NUTCH-258 (avoid dying on SEVERE > logentry). That works for me, crawling works fine and it does not hang/crash. > However this is working around the problems not solving them - I know. But > it helps for the moment ... > Hope somebody can help - this loops quite important to track down to me. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-381) Ignore external link not work as expected
[ https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-381. --- Resolution: Won't Fix Fix Version/s: 0.9.0 Assignee: Andrzej Bialecki This was caused by following redirected pages immediately in Fetcher. Set http.redirect.max to 0 to avoid this problem. > Ignore external link not work as expected > - > > Key: NUTCH-381 > URL: https://issues.apache.org/jira/browse/NUTCH-381 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Uros Gruber > Assigned To: Andrzej Bialecki >Priority: Critical > Fix For: 0.9.0 > > > Currently there is no way to properly limit fetcher without regexp rules we > use ignore.external.link option but It seams that It doesn't work in all > cases. > Here is example urls I'm seeing but > cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. > fetching http://help.yahoo.com/help/sports > fetching http://www.turkish-xxx.com/adult-traffic-trade.php > fetching http://help.yahoo.com/help/us/astr/ > fetching http://www.polish-xxx.com/de-index.html > fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx > fetching http://help.yahoo.com/help/groups > fetching http://help.yahoo.com/help/fin/ > fetching > http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx > fetching http://help.yahoo.com/help/us/edit/ > fetching http://www.polish-xxx.com/es-index.html > Anyone notice this? > I assume that there must be something with expired domains where pages > generates randomly. But still why urls from other domain was added. Maybe > urlregexp filter +* exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-381) Ignore external link not work as expected
[ https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482266 ] Andrzej Bialecki commented on NUTCH-381: - Your last comment confirms my suspicions. After analysis of the code in Fetcher I can confirm that this indeed is the effect of handling redirects immediately - Fetcher doesn't check if the URLs we redirect to belong to the same host. The solution is to disable immediate redirects (set http.redirect.max to 0 in your configuration). > Ignore external link not work as expected > - > > Key: NUTCH-381 > URL: https://issues.apache.org/jira/browse/NUTCH-381 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Uros Gruber >Priority: Critical > > Currently there is no way to properly limit fetcher without regexp rules we > use ignore.external.link option but It seams that It doesn't work in all > cases. > Here is example urls I'm seeing but > cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. > fetching http://help.yahoo.com/help/sports > fetching http://www.turkish-xxx.com/adult-traffic-trade.php > fetching http://help.yahoo.com/help/us/astr/ > fetching http://www.polish-xxx.com/de-index.html > fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx > fetching http://help.yahoo.com/help/groups > fetching http://help.yahoo.com/help/fin/ > fetching > http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx > fetching http://help.yahoo.com/help/us/edit/ > fetching http://www.polish-xxx.com/es-index.html > Anyone notice this? > I assume that there must be something with expired domains where pages > generates randomly. But still why urls from other domain was added. Maybe > urlregexp filter +* exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-461) microformats-reltag plugin and relative links
microformats-reltag plugin and relative links - Key: NUTCH-461 URL: https://issues.apache.org/jira/browse/NUTCH-461 Project: Nutch Issue Type: Bug Affects Versions: 0.8.1, 0.7.2, 0.8.2, 0.9.0 Reporter: Jerome Charron The microformats-reltag plugin doesn't extract tags from relative URLs. In fact, the code tries to construct a valid URL from the href. If the href is relative, the URL construction crash and then the tag is not extracted. Solution: Simply use a fake base for URL construction. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: Launching custom classes
> -Original Message- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Monday, March 19, 2007 10:18 AM > To: nutch-dev@lucene.apache.org > Subject: Re: Launching custom classes > > Steve Severance wrote: > > Hi all, > > I have a custom class in the nutch jar. Everything works fine in > eclipse but > > when I try to run it from the command line using bin/nutch it throws > the > > java.lang.NoClassDefFoundError. All the pages on the internet > helpfully > > suggested that I make sure that the jar is in the classpath. I think > that > > > > What needs to be on your classpath is the *.job jar. The bin/nutch > script takes care of that if you built your Nutch using the command- > line > version of ant. Ok. Thanks. 2 more things. I have 2 directories for nutch, 1 is synchronized with SVN and the other is my working directory. If I run the ant package command in my working directory ant says BUILD FAILED g:\NutchInstance\build.xml:61: Specify at least one source--a file or resource collection. Total time: 0 seconds If I copy my source folder into the trunk dir for my directory that is synced with SVN my class does not get added. I have been studying the build.xml file and I see the plugin generation jobs, but my reasoning is that my package name is org.apache.nutch. should be compiled into the core. Is this correct? Do I need to make a separate build job for my class or something like that? Second, how do people generally setup their development machines? Do you use Eclipse, if so do you just work off of the trunk or what? What is recommendation for source control in this situation? Is there a way to make a subversion repository for me so that I can add my own code but also receive updates from the trunk? Using an open source project like this seems to add some complexity to the source control process. But I am sure this problem has already been worked out. Regards, Steve > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com
Re: Launching custom classes
Steve Severance wrote: Hi all, I have a custom class in the nutch jar. Everything works fine in eclipse but when I try to run it from the command line using bin/nutch it throws the java.lang.NoClassDefFoundError. All the pages on the internet helpfully suggested that I make sure that the jar is in the classpath. I think that What needs to be on your classpath is the *.job jar. The bin/nutch script takes care of that if you built your Nutch using the command-line version of ant. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Launching custom classes
Hi all, I have a custom class in the nutch jar. Everything works fine in eclipse but when I try to run it from the command line using bin/nutch it throws the java.lang.NoClassDefFoundError. All the pages on the internet helpfully suggested that I make sure that the jar is in the classpath. I think that everything is correct since I can invoke any of the nutch classes via its class name e.g. bin/nutch org.apache.nutch.crawl.Crawl. This may be a simple Java problem but I have been banging my head against this all weekend. Thanks, Steve