Hello everyone,

we're happy to announce the 1.0.0-rc2 release of Pydoop (http://crs4.github.io/pydoop), the non-Streaming Python interface to Hadoop. Adding to the simplified installation and new Pythonic API introduced with 1.0.0-rc1, this rc provides built-in Avro support (for now, only with Hadoop 2). By setting a few flags in the submitter and selecting the new AvroContext as your application's context class, you can read and write Avro data, transparently manipulating records as Python dictionaries. For instance, you could count your favorite colors stored in an Avro file like this:

   export STATS_SCHEMA=$(cat stats.avsc)
   pydoop submit \
     -D pydoop.mapreduce.avro.value.output.schema="${STATS_SCHEMA}" \
     --avro-input v --avro-output v \
     --upload-file-to-cache color_count.py --mrv2 \
     color_count input output

And your Pydoop code would be these few lines:

   class Mapper(api.Mapper):
       def map(self, ctx):
           user = ctx.value
           color = user['favorite_color']
           if color is not None:
               ctx.emit(user['office'], Counter({color: 1}))

   class Reducer(api.Reducer):
       def reduce(self, ctx):
           s = sum(ctx.values, Counter())
           ctx.emit('', {'office': ctx.key, 'counts': s})

Any input/output format that exchanges Avro records is supported, including the Parquet ones. For more detailed information, see the docs at http://crs4.github.io/pydoop/examples/avro.html

Pydoop is a Python API for Hadoop that allows you to write full-fledged MapReduce applications with HDFS access. Pydoop powers several scientific projects at CRS4, including Seal (http://biodoop-seal.sourceforge.net), Biodoop-BLAST (http://biodoop.sourceforge.net/blast) and VISPA (https://github.com/crs4/vispa), as well as successful commercial services such as Slacker Radio (http://www.slacker.com).

Please note that this is a release candidate that's not been used in production yet. This means, among other things, that you have to add the "--pre" flag if installing with pip. As usual, we're happy to receive your feedback: please open an issue on GitHub if you spot a bug or find something that could be improved.

Links:

  * download: http://pypi.python.org/pypi/pydoop
  * docs: http://crs4.github.io/pydoop
  * git repo: https://github.com/crs4/pydoop
  * paper: dx.doi.org/10.1145/1851476.1851594
  * Dr.Dobb's review:
http://www.drdobbs.com/database/pydoop-writing-hadoop-programs-in-python/240156473

Happy pydooping!

The Pydoop Team

--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone....@crs4.it
http://www.crs4.it

Reply via email to