Our software doesn't use MapReduce. It is a pure YARN application that is basically a peer to MapReduce. There are a lot of reasons for this decision, but the main one is that we have a large code base that already executes data transformations in a single-server environment, and we wanted to produce a product without rewriting huge swaths of code. Given that, our software takes care of many things usually delegated to MapReduce, including distributed sort/partition (i.e. "the shuffle"). However, MapReduce has a special place in the ecosystem, in that it creates an auxiliary service to handle the distribution of shuffle data to reducers. It doesn't look like third-party apps have an easy time installing aux services. The JARs for any such service must be in Hadoop's classpath on all nodes at startup, creating both a management issue and a trust/security issue. Currently our software places temporary data into HDFS for this purpose, but we've found that HDFS has a huge overhead in terms of performance and file handles, even at low replication. We desire to replace the use of HDFS with a lighter-weight service to manage temp files and distribute their data.
Is the slider project something that can address our needs? John Lilley
