As we talked about in the following discussion (A), I'm considering two ways to implement a distributed map-reduce builder.
Given the reference implementation, the easiest implementation is the following: * the data is distributed to the slave nodes using the DistributedCache * each mapper loader the data in-memory when in JobConfigurable.configure() * each tree is built by one mapper * the Job doesn't really need any input data, it may be possible to implement our own InputFormat that generates InputSplit s using the configuration parameters (number of trees) * the mapper uses DecisionTree.toString() to output the tree in a String * the main program builds the forest using DecisionTree.parse(String) for each tree Pros: * the easiest implementation because the actual ref. code can be used as it is, and the distributed implementation can benefit from future optimizations of the ref. code Cons: * because its based on the ref. implementation, it will be very slow when dealing with large datasets. For example, with half of the KDD 99 dataset (a file of about 350 Mb), building one single tree will took more than 12 hours (in fact I stopped the program after 12 hours) in a core 2 2Ghz with 3 Gb of RAM laptop !!! * if the slave nodes contain many computing cores it would be interesting to launch parallel mappers in every slave node to exploit the multi-threading. But as I didn't found a way to share a memory variable between the mappers, each mapper must load its own copy of the Data in memory ! Important: * The ref. implementation memory usage will probably change when the Information Gain computing will be optimized, in this case an out-of-core approach could become viable So I'm asking, what to do next ? * go on with this implementation * optimize the IG computing * consider the "Distributed Implementation B", see (A), which deals with BIG datasets and should not need any IG optimization, but its code has nearly nothing to do with the ref. implementation (A) http://markmail.org/message/mgap2nuhnl4kokeu