On Fri, Aug 14, 2009 at 3:10 PM, bradford cross<[email protected]> wrote: > We have just released flightcaster.com which uses statistical inference and > machine learning to predict flight delays in advance of airlines (initial > results appear to do so with 85 - 90 % accuracy.) > > The webserver and webapp are all rails running on the Heroku platform; which > also serves our blackberry and iphone apps. > > The research and data heavy lifting is all in Clojure. > > Distributed data mining is done via a custom layer on top of cascading > (which is a layer on top of hadoop.) All run on EC2 and S3 using the very > nice cloudera AMIs and deployment scripts. > > In addition to the machine learning, the layer atop cascading performs all > the complex data data filtering and transformation operations; including > distributed joins from heterogeneous data sources and transformations into a > time series view that is fed to the machine learning computations that are > rolled into mappers and reducers. Remember, this is data from airlines and > the FAA, it is not pretty. Web data is messy but we have lots of good > frameworks, libs and sanitizers for web data. > > We wrapped cascading in a thin layer that we use to wrap clojure functions > in the cascading function objects and inject those into individual steps in > the workflows. This gets us very close to normal function composition for > the client code. Ultimately, we want to be able to do normal function > composition to compose cascading workflows in the same way as we would would > do vanilla function composition for small test runs on our local machines. > This is an execution agnostic programming model; client code doesn't bear > the signs of distributed execution. > > As a beneficial side effect, we found that this model forces us to have more > fine grained abstractions - because each operation must be ultimately be > injectable into a map-reduce phase, otherwise your paralleizm will be > unnecessarily course grained. This steers us clear of monolithic > uber-expressions. > > Another aspect of the design that allows us to do this is that the data > transformations write out clojure data structure literals, so we are > entirely insulated from the normal hadoop input/output formats...the wrapper > layer just uses the normal clojure reader to read in the strings from hadoop > and apply the vanilla clojure functions to the data structures. But we are > not limited to only clojure data structure literals. We also inject other > readers that can read other strings to clojure data structures, for example. > we use Dan Larkin's wonderful json lib for the initial reads of the raw json > data we store. > > All the analytical code is custom, so we don't use many 3rd party libs > outside of cascading, hadoop, the invaluable jets3t for working with s3. > Oh, and of course, - since we do so much with temporal analysis - joda-time > is the only way to work with dates in a sane way on the jvm. :-) > > If you travel a lot, check us out: flightcaster.com ... we have iphone and > blackberry apps. Unfortunately this is domestic US air travel only at the > moment due to the difficulty of of obtaining data for international carriers > and aviation agencies. >
Very cool - congrats! Rich --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to [email protected] Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---
