Hi Jason,

I work for an international organization involved in the mobilization of
biodiversity data (specifically we are dealing a lot with observations of
species) so think of it as a lot of point based information with metadata
tags.  We have built an Oozie workflow that uses Sqoop to suck in a few
databases and then does a big transformation and set of quality control
which we did using Hive and some custom UDFs.  There is a blog introducing
this on
http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/

All our work and data are open, so I can freely write about any of it, and
can link to real production code in Google svn.

If it would be of interest to you I am happy to discuss what would be most
useful to help write up for your book.  Some possible angles you might
consider:
- real UDFs in action (e.g. parsing species scientific names)
- UDTFs to generate a Google map tile cache
- Hive in an ETL workflow to remove load from DBs
- The pros and cons of calling web services from a UDF (we do it, but it
keeps concerns clean and accept the risk of a DDoS we can control)
- Sqoop and Hive together
- We are getting into Hive on HBase and have found UDFs can help with type
safety since we aren't running HIVE-1634
  [with the advancements in Hive 0.9 I would think our workarounds are not
worth documenting]
- Metrics illustrating the importance of join order, and knowing data
cardinality to ensure decent performance.

Hope this is of interest,
Tim





On Wed, Apr 11, 2012 at 7:48 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> Dear Hive User,
>
> We want your interesting case study for our upcoming book titled
> 'Programming Hive' from O'Reilly.
>
> How you use Hive, either high level or low level code details are both
> encouraged!
>
> Feel free to reach out with a brief abstract.
>
> Regards,
>
> Jason Rutherglen
>

Reply via email to