I thought I would post the solution I am going to use to fix my
problem.. Maybe it will help someone else out as well.. You guys have
helped me a lot so..
I did not see a way to do such an "odd thing" in Nutch.. So decided to
write a chunk of code that that would "translate" the url from our
dataset to the final url which I need to crawl and construct indexes
from..
Maybe this will be of some help for someone else..
package Driver;
import org.apache.commons.httpclient.Header;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
public class Driver {
public static void main(String[] args) {
Driver driver = new Driver();
System.out.println(driver.getFinalURL("http://www.3menandaladyhandyman.s
mrated.com"));
}
private String getFinalURL(String urlString){
String redirectLocation = null;
HttpClient client = new HttpClient();
GetMethod method = new GetMethod(urlString);
// does not seem to work as I thought??
//Header locationHeader =
method.getRequestHeader("Host");
try {
client.executeMethod(method);
System.out.println("Method failed: " +
method.getStatusLine());
} catch (Exception e) {
e.printStackTrace();
}
Header[] headers = (Header[])method.getRequestHeaders();
for (int index=0; index< headers.length; index++) {
if
(headers[index].getName().equalsIgnoreCase("Host")) {
System.out.println("name=>" +
headers[index].getName());
redirectLocation =
headers[index].getValue();
}
}
return redirectLocation;
}
}
-----Original Message-----
From: Lukas, Ray [mailto:[email protected]]
Sent: Monday, May 04, 2009 2:14 PM
To: [email protected]
Subject: RE: Re-direct in Nutch does not seem to work
I think that my nutch-site.xml setting will kill re-directs..
Just remembered this
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>
Don't go to External Links, just stay in the domain
that I passed into you
</description>
</property>
I only want to scan within the domain I requested... Unless that url
instantly re-directs me to a different URL and then I want to only use
that one. Any thoughts..
Am I understanding this correctly?
Ray
-----Original Message-----
From: Lukas, Ray [mailto:[email protected]]
Sent: Monday, May 04, 2009 1:56 PM
To: [email protected]
Subject: Re-direct in Nutch does not seem to work
Re-direct in Nutch 1.0 does not seem to work..
If I point to a url that is "re-directed to" (the result of a
re-direction, everything works great, if I point to the page that is
re-directing me to the working one, I get a corrupted index.
Can nutch handle re-direction and if so what magic is required?
ray