Hello, I've developed an extension to Heritrix (The Internet Archive open source crawler) that allows it to write directly into HDFS. It looks like the developers over there are interested in including it into their project. I've designed it to write SequenceFiles and use the URL as the key and the HTTP response as the value. I've got a couple of questions that I could use a little help on:
1. I can't seem to set the replication factor on a SequenceFile. There's no way to pass it in and when I call the createWriter factory and then call FileSystem.setReplication, it still seems to use the default value. Is there anyway to do this, or should I file an enhancement request? 2. It appears that the Configuration class looks for the conf/ directory in the CLASSPATH. This makes it difficult to integrate with Heritrix. For now, I've modified the heritrix launch script by hardcoding the hadoop configuration directory into the CLASSPATH. It seems like a better way to go would be to provide a text box on the Heritrix settings page that allows the user to enter the path to the Hadoop configuration directory. - Doug Judd [EMAIL PROTECTED] http://www.zvents.com/
