Hi,
I'm very happy to report that I was able to run Nutch using GCJ - both
for out-of-the-box compilation and as a runtime VM.
The OS is Fedora 5b2, gcj -v reports "gcc version 4.1.0 20060106 (Red
Hat 4.1.0-0.14)".
I encountered only minor problems, with simple workarounds:
* JAVA_HOME is not set by default. I set it to /usr, where bin/java ->
bin/gcj resides, and it worked. You should set it to wherever you have
the bin/java binary.
* lots and lots of warnings emitted during compilation. Even if it's
annoying (SUN javac either ignores them or emits a single warning
message per compilation unit), it's certainly useful - we should look at
these places and see if we can fix anything.
* protocol-httpclient wouldn't compile, because it uses private Sun SSL
classes. This can be fixed simply by replacing "com.sun" with "javax",
and implementing 2 empty methods in DummyX509TrustManager. We should do
it anyway, it's bad coding (mea culpa :).
* Hadoop Configuration.java:428 makes an explicit cast to
org.apache.xerces.dom.DocumentImpl, but gcj uses by default its own
implementation, so it would throw a ClassCastException. This I fixed by
adding two JARs from the Xalan distribution (xalan.jar and
serializer.jar), which apparently take precedence over the built-in XSL
processor (theoretically, you should then specify
-Djavax.xml.transform.TransformerFactory=org.apache.xalan.processor.TransformerFactoryImpl
but I didn't need this, not sure why).
After applying these fixes I was able to run the whole Nutch workflow. 8-D
No performance numbers yet, I don't have any appropriate test setup at
the moment. However, for crawling the same segments GCJ seems to quickly
allocate and "pin down" all necessary heap space from OS (the resident
mem size of the process was > 90% of my real RAM) - I quickly ran out of
the real memory and the OS had to start swapping, which of course
affected the performance; whereas SUN java seems to do it piece-wise and
overall, it consumed much less memory than GCJ in this limited test (the
resident mem size was very low, ca. 30MB). The virtual mem size was
nearly identical, ~1150MB.
I also saw a message from gij which may indicate some further lurking
memory mgmt problems:
GC Warning: Repeated allocation of very large block (appr. size 6578176):
May lead to memory leak and poor performance
So, we'll see. But to be fair, if it has anything to do with the
message, I ran it on a machine with relatively little RAM (~512M), and
the gij process used all of it + a sizable chunk of swap (I left the
default setting of -Xmx1000m, and as I mentioned above gij happily
allocated all of it). If there is some magic option I should have used
with gij, I'd love to know it...
Nonetheless, I must say I'm impressed - even if there were some memory
mgmt problems, at the end of the day the whole process was stable, and
the overall fetching speed in each case was very similar (63 kb/s with
gij, 75 kb/s with Sun; I used the default settings with 10 threads).
My hat's off to GCJ folks - it's amazing how far it's progressed ... if
only the GUI and JNI apps were similarly advanced ;-)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com