Hi Jason, I work for an international organization involved in the mobilization of biodiversity data (specifically we are dealing a lot with observations of species) so think of it as a lot of point based information with metadata tags. We have built an Oozie workflow that uses Sqoop to suck in a few databases and then does a big transformation and set of quality control which we did using Hive and some custom UDFs. There is a blog introducing this on http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/
All our work and data are open, so I can freely write about any of it, and can link to real production code in Google svn. If it would be of interest to you I am happy to discuss what would be most useful to help write up for your book. Some possible angles you might consider: - real UDFs in action (e.g. parsing species scientific names) - UDTFs to generate a Google map tile cache - Hive in an ETL workflow to remove load from DBs - The pros and cons of calling web services from a UDF (we do it, but it keeps concerns clean and accept the risk of a DDoS we can control) - Sqoop and Hive together - We are getting into Hive on HBase and have found UDFs can help with type safety since we aren't running HIVE-1634 [with the advancements in Hive 0.9 I would think our workarounds are not worth documenting] - Metrics illustrating the importance of join order, and knowing data cardinality to ensure decent performance. Hope this is of interest, Tim On Wed, Apr 11, 2012 at 7:48 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Dear Hive User, > > We want your interesting case study for our upcoming book titled > 'Programming Hive' from O'Reilly. > > How you use Hive, either high level or low level code details are both > encouraged! > > Feel free to reach out with a brief abstract. > > Regards, > > Jason Rutherglen >