Hi,

I'm very happy to report that I was able to run Nutch using GCJ - both for out-of-the-box compilation and as a runtime VM.

The OS is Fedora 5b2, gcj -v reports "gcc version 4.1.0 20060106 (Red Hat 4.1.0-0.14)".

I encountered only minor problems, with simple workarounds:

* JAVA_HOME is not set by default. I set it to /usr, where bin/java -> bin/gcj resides, and it worked. You should set it to wherever you have the bin/java binary.

* lots and lots of warnings emitted during compilation. Even if it's annoying (SUN javac either ignores them or emits a single warning message per compilation unit), it's certainly useful - we should look at these places and see if we can fix anything.

* protocol-httpclient wouldn't compile, because it uses private Sun SSL classes. This can be fixed simply by replacing "com.sun" with "javax", and implementing 2 empty methods in DummyX509TrustManager. We should do it anyway, it's bad coding (mea culpa :).

* Hadoop Configuration.java:428 makes an explicit cast to org.apache.xerces.dom.DocumentImpl, but gcj uses by default its own implementation, so it would throw a ClassCastException. This I fixed by adding two JARs from the Xalan distribution (xalan.jar and serializer.jar), which apparently take precedence over the built-in XSL processor (theoretically, you should then specify -Djavax.xml.transform.TransformerFactory=org.apache.xalan.processor.TransformerFactoryImpl but I didn't need this, not sure why).

After applying these fixes I was able to run the whole Nutch workflow. 8-D

No performance numbers yet, I don't have any appropriate test setup at the moment. However, for crawling the same segments GCJ seems to quickly allocate and "pin down" all necessary heap space from OS (the resident mem size of the process was > 90% of my real RAM) - I quickly ran out of the real memory and the OS had to start swapping, which of course affected the performance; whereas SUN java seems to do it piece-wise and overall, it consumed much less memory than GCJ in this limited test (the resident mem size was very low, ca. 30MB). The virtual mem size was nearly identical, ~1150MB.

I also saw a message from gij which may indicate some further lurking memory mgmt problems:

GC Warning: Repeated allocation of very large block (appr. size 6578176):
      May lead to memory leak and poor performance

So, we'll see. But to be fair, if it has anything to do with the message, I ran it on a machine with relatively little RAM (~512M), and the gij process used all of it + a sizable chunk of swap (I left the default setting of -Xmx1000m, and as I mentioned above gij happily allocated all of it). If there is some magic option I should have used with gij, I'd love to know it...

Nonetheless, I must say I'm impressed - even if there were some memory mgmt problems, at the end of the day the whole process was stable, and the overall fetching speed in each case was very similar (63 kb/s with gij, 75 kb/s with Sun; I used the default settings with 10 threads).

My hat's off to GCJ folks - it's amazing how far it's progressed ... if only the GUI and JNI apps were similarly advanced ;-)

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to