Pydoop 1.0.0-rc2

Simone Leo Tue, 10 Mar 2015 09:19:45 -0700

Hello everyone,

we're happy to announce the 1.0.0-rc2 release of Pydoop(http://crs4.github.io/pydoop), the non-Streaming Python interface toHadoop. Adding to the simplified installation and new Pythonic APIintroduced with 1.0.0-rc1, this rc provides built-in Avro support (fornow, only with Hadoop 2). By setting a few flags in the submitter andselecting the new AvroContext as your application's context class, youcan read and write Avro data, transparently manipulating records asPython dictionaries. For instance, you could count your favorite colorsstored in an Avro file like this:


   export STATS_SCHEMA=$(cat stats.avsc)
   pydoop submit \
     -D pydoop.mapreduce.avro.value.output.schema="${STATS_SCHEMA}" \
     --avro-input v --avro-output v \
     --upload-file-to-cache color_count.py --mrv2 \
     color_count input output

And your Pydoop code would be these few lines:

   class Mapper(api.Mapper):
       def map(self, ctx):
           user = ctx.value
           color = user['favorite_color']
           if color is not None:
               ctx.emit(user['office'], Counter({color: 1}))

   class Reducer(api.Reducer):
       def reduce(self, ctx):
           s = sum(ctx.values, Counter())
           ctx.emit('', {'office': ctx.key, 'counts': s})

Any input/output format that exchanges Avro records is supported,including the Parquet ones. For more detailed information, see the docsat http://crs4.github.io/pydoop/examples/avro.html

Pydoop is a Python API for Hadoop that allows you to write full-fledgedMapReduce applications with HDFS access. Pydoop powers severalscientific projects at CRS4, including Seal(http://biodoop-seal.sourceforge.net), Biodoop-BLAST(http://biodoop.sourceforge.net/blast) and VISPA(https://github.com/crs4/vispa), as well as successful commercialservices such as Slacker Radio (http://www.slacker.com).

Please note that this is a release candidate that's not been used inproduction yet. This means, among other things, that you have to addthe "--pre" flag if installing with pip. As usual, we're happy toreceive your feedback: please open an issue on GitHub if you spot a bugor find something that could be improved.


Links:

  * download: http://pypi.python.org/pypi/pydoop
  * docs: http://crs4.github.io/pydoop
  * git repo: https://github.com/crs4/pydoop
  * paper: dx.doi.org/10.1145/1851476.1851594
  * Dr.Dobb's review:
http://www.drdobbs.com/database/pydoop-writing-hadoop-programs-in-python/240156473

Happy pydooping!

The Pydoop Team

--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone....@crs4.it
http://www.crs4.it

Pydoop 1.0.0-rc2

Reply via email to