[jira] Created: (NUTCH-462) Noarchive urls are available via the cache link

2007-03-19 Thread Steve Severance (JIRA)
Noarchive urls are available via the cache link
---

 Key: NUTCH-462
 URL: https://issues.apache.org/jira/browse/NUTCH-462
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Steve Severance
 Fix For: 0.8.1


If a robots.txt file specifies a Noarchive statement then urls that or 
contained as part of that path should not be available via the cached link.

For example Noarchive:/ means that no pages should be available via the cached 
link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Issues pending before 0.9 release

2007-03-19 Thread Andrzej Bialecki

Hi all,

I just committed Hadoop 0.12.1. Let's double-check that it works ok. 
Here's the list of Critical/Blocker issues I mentioned before, and their 
current status:


NUTCH-400   Fixed.
NUTCH-353   Moved to Major, fix after release.
NUTCH-233   Fixed.
NUTCH-436   Fixed.
NUTCH-427   Moved to Major, fix after release.
NUTCH-381   Won't fix - this is a configuration issue.
NUTCH-277   Cannot reproduce
NUTCH-167   Fixed.

Any other stuff we need to fix before the release?

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Closed: (NUTCH-450) How to set up nutch

2007-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-450.
---

Resolution: Invalid
  Assignee: Andrzej Bialecki 

This belongs in nutch-user mailing list, please seek help there.

> How to set up nutch
> ---
>
> Key: NUTCH-450
> URL: https://issues.apache.org/jira/browse/NUTCH-450
> Project: Nutch
>  Issue Type: Task
>  Components: administration gui
> Environment: Windows XP
>Reporter: Sandya S Murthy
> Assigned To: Andrzej Bialecki 
>
> I have followed the instruction given in nutch tutorial to set up the nutch,
> I installed J2sdk , tomcat 4.x, and cygwin after that i download the nutch 
> version 0.7.2. but i did'nt understand how to connect this nutch to the root 
> directory, how to set up this downloaded nutch folder.
> pls help

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-451) Tool to recover partial fetcher output

2007-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-451:


Priority: Minor  (was: Major)

> Tool to recover partial fetcher output
> --
>
> Key: NUTCH-451
> URL: https://issues.apache.org/jira/browse/NUTCH-451
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Andrzej Bialecki 
> Assigned To: Andrzej Bialecki 
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: LocalFetchRecover-0.8.1.java, LocalFetchRecover.java
>
>
> This class may help you to recover partial data from a failed Fetcher run. 
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. 
> you didn't use DFS - partial output to DFS is permanently lost if a process 
> fails to properly close the output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial 
> SequenceFile-s will be corrupted at the end. This means that it won't be 
> possible to recover all data from them - most likely only the data up to the 
> last sync marker can be recovered.
> The recovery proces requires some preparation: 
> * determine the map directories corresponding to the map task outputs of the 
> failed job. These map directories contain SequenceFile-s consisting of pairs 
> of , named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s 
> into this directory, renaming them sequentially like this: 
>   input/part-0
>   input/part-1
>   input/part-2
>   input/part-3
>   ...
>   
> * specify the "input" directory as the input to this tool. 
> If all goes well, a new segment will be created as a subdirectory of the 
> output dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-353:


Priority: Major  (was: Blocker)

This i partially fixed so that page status is consistent. LinkDb related 
changes will be implemented later.

> pages that serverside forwards will be refetched every time
> ---
>
> Key: NUTCH-353
> URL: https://issues.apache.org/jira/browse/NUTCH-353
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8.1, 0.9.0
>Reporter: Stefan Groschupf
> Assigned To: Andrzej Bialecki 
> Fix For: 0.9.0
>
> Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back 
> into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch 
> is not polite and refetching the forwarding and target page in each segment 
> iteration. Also it effects the scoring since the forward page contribute it's 
> score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-459) Upgrade Nutch to Hadoop 0.12.1

2007-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-459.
---

Resolution: Fixed

Upgraded to 0.12.1 release.

> Upgrade Nutch to Hadoop 0.12.1
> --
>
> Key: NUTCH-459
> URL: https://issues.apache.org/jira/browse/NUTCH-459
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.9.0
> Environment: All platforms
>Reporter: Dennis Kubes
> Assigned To: Dennis Kubes
> Fix For: 0.9.0
>
> Attachments: hadoop-0.12.1-dev-core.jar
>
>
> This JIRA contains the new hadoop-0.12.1-dev-core.jar as of revision 518636.  
> I far as I can tell this jar doesn't break any of the current Nutch trunk 
> code as of revision 517382.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-277) Fetcher dies because of "max. redirects" (avoiding infinite loop)

2007-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-277.
---

   Resolution: Cannot Reproduce
Fix Version/s: 0.9.0
 Assignee: Andrzej Bialecki 

Cannot reproduce this. If the problem reappears please create a new issue.

> Fetcher dies because of "max. redirects" (avoiding infinite loop)
> -
>
> Key: NUTCH-277
> URL: https://issues.apache.org/jira/browse/NUTCH-277
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8
> Environment: nightly-2006-05-20
>Reporter: Stefan Neufeind
> Assigned To: Andrzej Bialecki 
>Priority: Critical
> Fix For: 0.9.0
>
>
> Error in the logs is:
> 060521 213401 SEVERE Narrowly avoided an infinite loop in execute
> org.apache.commons.httpclient.RedirectException: Maximum redirects (100) 
> exceeded
> at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:183)
> at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396)
> at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
> at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:87)
> at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:97)
> at 
> org.apache.nutch.protocol.http.api.RobotRulesParser.isAllowed(RobotRulesParser.java:394)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:173)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
> This happens during normal crawling. Unfortunately I don't know how to 
> further track this down. But it's problematic, since it actually makes the 
> fetcher die.
> Workaround (for the symptom) is in NUTCH-258 (avoid dying on SEVERE 
> logentry). That works for me, crawling works fine and it does not hang/crash. 
>  However this is working around the problems not solving them - I know. But 
> it helps for the moment ...
> Hope somebody can help - this loops quite important to track down to me.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-381) Ignore external link not work as expected

2007-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-381.
---

   Resolution: Won't Fix
Fix Version/s: 0.9.0
 Assignee: Andrzej Bialecki 

This was caused by following redirected pages immediately in Fetcher. Set 
http.redirect.max to 0 to avoid this problem.

> Ignore external link not work as expected
> -
>
> Key: NUTCH-381
> URL: https://issues.apache.org/jira/browse/NUTCH-381
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Uros Gruber
> Assigned To: Andrzej Bialecki 
>Priority: Critical
> Fix For: 0.9.0
>
>
> Currently there is no way to properly limit fetcher without regexp rules we 
> use ignore.external.link option but It seams that It doesn't work in all 
> cases.
> Here is example urls I'm seeing but
> cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. 
> fetching http://help.yahoo.com/help/sports
> fetching http://www.turkish-xxx.com/adult-traffic-trade.php
> fetching http://help.yahoo.com/help/us/astr/
> fetching http://www.polish-xxx.com/de-index.html
> fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
> fetching http://help.yahoo.com/help/groups
> fetching http://help.yahoo.com/help/fin/
> fetching 
> http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
> fetching http://help.yahoo.com/help/us/edit/
> fetching http://www.polish-xxx.com/es-index.html
> Anyone notice this?
> I assume that there must be something with expired domains where pages 
> generates randomly. But still why urls from other domain was added. Maybe 
> urlregexp filter +* exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-381) Ignore external link not work as expected

2007-03-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482266
 ] 

Andrzej Bialecki  commented on NUTCH-381:
-

Your last comment confirms my suspicions. After analysis of the code in Fetcher 
I can confirm that this indeed is the effect of handling redirects immediately 
- Fetcher doesn't check if the URLs we redirect to belong to the same host.

The solution is to disable immediate redirects (set http.redirect.max to 0 in 
your configuration).

> Ignore external link not work as expected
> -
>
> Key: NUTCH-381
> URL: https://issues.apache.org/jira/browse/NUTCH-381
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Uros Gruber
>Priority: Critical
>
> Currently there is no way to properly limit fetcher without regexp rules we 
> use ignore.external.link option but It seams that It doesn't work in all 
> cases.
> Here is example urls I'm seeing but
> cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. 
> fetching http://help.yahoo.com/help/sports
> fetching http://www.turkish-xxx.com/adult-traffic-trade.php
> fetching http://help.yahoo.com/help/us/astr/
> fetching http://www.polish-xxx.com/de-index.html
> fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
> fetching http://help.yahoo.com/help/groups
> fetching http://help.yahoo.com/help/fin/
> fetching 
> http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
> fetching http://help.yahoo.com/help/us/edit/
> fetching http://www.polish-xxx.com/es-index.html
> Anyone notice this?
> I assume that there must be something with expired domains where pages 
> generates randomly. But still why urls from other domain was added. Maybe 
> urlregexp filter +* exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-461) microformats-reltag plugin and relative links

2007-03-19 Thread Jerome Charron (JIRA)
microformats-reltag plugin and relative links
-

 Key: NUTCH-461
 URL: https://issues.apache.org/jira/browse/NUTCH-461
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.7.2, 0.8.2, 0.9.0
Reporter: Jerome Charron


The microformats-reltag plugin doesn't extract tags from relative URLs.
In fact, the code tries to construct a valid URL from the href. If the href is 
relative, the URL construction crash and then the tag is not extracted.

Solution: Simply use a fake base for URL construction.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: Launching custom classes

2007-03-19 Thread Steve Severance
> -Original Message-
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 19, 2007 10:18 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Launching custom classes
> 
> Steve Severance wrote:
> > Hi all,
> > I have a custom class in the nutch jar. Everything works fine in
> eclipse but
> > when I try to run it from the command line using bin/nutch it throws
> the
> > java.lang.NoClassDefFoundError. All the pages on the internet
> helpfully
> > suggested that I make sure that the jar is in the classpath. I think
> that
> >
> 
> What needs to be on your classpath is the *.job jar. The bin/nutch
> script takes care of that if you built your Nutch using the command-
> line
> version of ant.

Ok. Thanks. 2 more things. I have 2 directories for nutch, 1 is synchronized 
with SVN and the other is my working directory. If I run the ant package 
command in my working directory ant says 
BUILD FAILED
g:\NutchInstance\build.xml:61: Specify at least one source--a file or resource 
collection.

Total time: 0 seconds

If I copy my source folder into the trunk dir for my directory that is synced 
with SVN my class does not get added. I have been studying the build.xml file 
and I see the plugin generation jobs, but my reasoning is that my package name 
is org.apache.nutch. should be compiled into the core. Is this 
correct? Do I need to make a separate build job for my class or something like 
that?

Second, how do people generally setup their development machines? Do you use 
Eclipse, if so do you just work off of the trunk or what? What is 
recommendation for source control in this situation? Is there a way to make a 
subversion repository for me so that I can add my own code but also receive 
updates from the trunk? Using an open source project like this seems to add 
some complexity to the source control process. But I am sure this problem has 
already been worked out.

Regards,

Steve

> 
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com




Re: Launching custom classes

2007-03-19 Thread Andrzej Bialecki

Steve Severance wrote:

Hi all,
I have a custom class in the nutch jar. Everything works fine in eclipse but
when I try to run it from the command line using bin/nutch it throws the
java.lang.NoClassDefFoundError. All the pages on the internet helpfully
suggested that I make sure that the jar is in the classpath. I think that
  


What needs to be on your classpath is the *.job jar. The bin/nutch 
script takes care of that if you built your Nutch using the command-line 
version of ant.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Launching custom classes

2007-03-19 Thread Steve Severance
Hi all,
I have a custom class in the nutch jar. Everything works fine in eclipse but
when I try to run it from the command line using bin/nutch it throws the
java.lang.NoClassDefFoundError. All the pages on the internet helpfully
suggested that I make sure that the jar is in the classpath. I think that
everything is correct since I can invoke any of the nutch classes via its
class name e.g. bin/nutch org.apache.nutch.crawl.Crawl. This may be a simple
Java problem but I have been banging my head against this all weekend.

Thanks,

Steve