RE: Re-direct in Nutch does not seem to work : solution

Lukas, Ray Mon, 04 May 2009 13:35:53 -0700

 I thought I would post the solution I am going to use to fix my
problem.. Maybe it will help someone else out as well.. You guys have
helped me a lot so..


I did not see a way to do such an "odd thing" in Nutch.. So decided to
write a chunk of code that that would "translate" the url from our
dataset to the final url which I need to crawl and construct indexes
from.. 
Maybe this will be of some help for someone else.. 



package Driver;

import org.apache.commons.httpclient.Header;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;


public class Driver {
        public static void main(String[] args) {
                Driver driver = new Driver();
        
System.out.println(driver.getFinalURL("http://www.3menandaladyhandyman.s
mrated.com"));
        }
        private String getFinalURL(String urlString){
                String redirectLocation = null;
                HttpClient client = new HttpClient();
                GetMethod method = new GetMethod(urlString);    
        
                // does not seem to work as I thought??
                //Header locationHeader =
method.getRequestHeader("Host");
                
                try {
                        client.executeMethod(method);
                        System.out.println("Method failed: " +
method.getStatusLine());
                } catch (Exception e) {
                        e.printStackTrace();
                }
                Header[] headers = (Header[])method.getRequestHeaders();
                for (int index=0; index< headers.length; index++) {
                        if
(headers[index].getName().equalsIgnoreCase("Host")) {
                                System.out.println("name=>" +
headers[index].getName());
                                redirectLocation =
headers[index].getValue();
                        }
                }

                return redirectLocation;
        }
}

-----Original Message-----
From: Lukas, Ray [mailto:[email protected]] 
Sent: Monday, May 04, 2009 2:14 PM
To: [email protected]
Subject: RE: Re-direct in Nutch does not seem to work

I think that my nutch-site.xml setting will kill re-directs..
Just remembered this

        <property>
        <name>db.ignore.external.links</name>
        <value>true</value>
        <description>
        Don't go to External Links, just stay in the domain 
        that I passed into you
        </description>
        </property> 

I only want to scan within the domain I requested... Unless that url
instantly re-directs me to a different URL and then I want to only use
that one. Any thoughts.. 
Am I understanding this correctly?

Ray

-----Original Message-----
From: Lukas, Ray [mailto:[email protected]] 
Sent: Monday, May 04, 2009 1:56 PM
To: [email protected]
Subject: Re-direct in Nutch does not seem to work


 Re-direct in Nutch 1.0 does not seem to work..
If I point to a url that is "re-directed to" (the result of a
re-direction,  everything works great, if I point to the page that is
re-directing me to the working one, I get a corrupted index.
Can nutch handle re-direction and if so what magic is required?

ray

RE: Re-direct in Nutch does not seem to work : solution

Reply via email to