Hello,
 
I have some good news to bring back to the community, specifically anyone using 
FreeBSD and Nutch trunk (Hadoop 0.9.x). I have hacked through the makefiles of 
the Hadoop native compression libs to get it to compile and work on my FreeBSD 
box.
 
Now its just a start, and its certainly a dirty hack job even for my standards 
but it can work and I think that's the most important fact of all. I think most 
of it can be streamlined, but I didn't know a lot of the compile time settings 
before I actually started to compile it, then I dealt with each error as it 
came up.
 
I will either need to modify the configure  script, have a bunch of system 
variables the user will need to set prior to compile (e.g. JAVA_HOME, 
JVM_DATA_MODEL) or think of something else really smart to make this easier for 
us all.
 
But lets first go over what you will need for sure to compile this from source. 
The following ports (or binaries) will need to be installed:
 
gmake-3.81_1 (FreeBSD core "make" will not work)
autoconf-2.59_2
diablo-jdk-1.5.0.07.01_1
libtool-1.5.22_2
m4-1.4.4
 
The actual version number might vary, ports are always getting changed but as 
long as its the same or above as those you should be good.
 
The following system variables will need to be set:
 
JAVA_HOME - Where your JDK is, most likely "/usr/local/diablo-jdk1.5.0/"
JVM_DATA_MODEL - The bits of your JDK. Your either using the 32 or 64 bit 
version, so set this to "32" or "64".
 
It will require the following below also, but these should get automatically 
detected by the configure script:
 
HADOOP_NATIVE_SRCDIR
OS_NAME
OS_ARCH
 
You should now be able to run "./configure".
 
Okay, the fun begins. The makefile in the "lib/" folder has something we dont 
want (need?) that will error any compile attempt. The "libhadoop_la_LIBADD =" 
variable is set to "$(HADOOP_OBJS) -ldl -ljvm". Now I realize the first part 
should be something else, possibly added there during the configure script but 
it wasn't and it seems to make no difference. The real problem are those two 
settings (-ldl -ljvm), we need to remove those.
 
So that line should now read "libhadoop_la_LIBADD = $(HADOOP_OBJS)".
 
Now the next part I'm pretty sure is happening because I'm not doing it the 
"official" way and the paths are getting all screwed up, but its fairly easy to 
fix and wont cause any harm. Run the following commands, I had added comments 
why:
 
cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/ (it looks 
for it here instead)
cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
 (it wants the header file to be a different name for some reason, so we 
copy/rename it)
cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h
 (same idea as above, copy/rename)
 
Okay, so now you should be able to run "gmake" and everything will work. It 
puts the binaries in "lib/.libs/" prior to giving the install command. 
Personally, I don't want it to install in my regular lib directory 
(/usr/local/lib/) so I dump them in some other folder and remember the path.
 
The next steps to actually get it running with Nutch can be found on Sami 
Siren's blog at http://blog.foofactory.fi/ (hope you don't mind the plug). I 
tested it so far on my crawldb containing almost 50M urls, results are below;
 
Before:
 
link# du -m crawl/crawldb/
4092    crawl/crawldb/current/part-00000
4092    crawl/crawldb/current
4092    crawl/crawldb/
 
After a merge with no updates:
 
link# du -m crawl/crawldb/
688     crawl/crawldb/current/part-00000
688     crawl/crawldb/current
688     crawl/crawldb/

Now I'm sure I missed information, and I hope for some feedback, suggestions. 
If you want the binaries I created for whatever reason email me, just keep in 
mind your system needs to be FreeBSD/amd64 6.X using a 64-bit JVM for them to 
work.
 
Enjoy!
 
Sean Dean
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to