If you want to try out the Python API for Avro datafiles, I had written a short blog post on reading/writing that at http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ which still holds good I think. Hope this helps.
On Wed, Jan 25, 2012 at 1:50 AM, selvi k <gridsngat...@gmail.com> wrote: > I found out what the issue was: > I first needed to install snappy downloaded from here: > http://code.google.com/p/snappy/ > > After a simple ./configure, make and make install, 'easy_install avro' > completed successfully. > > I will try out both the CSV conversion options and update this thread in a > bit. > > -Selvi > > > > On Tue, Jan 24, 2012 at 2:37 PM, selvi k <gridsngat...@gmail.com> wrote: >> >> Douglas and Harsh - Thanks a lot for the immediate and detailed replies! >> Looks like both of these would work well for me. >> >> >> In order to start trying these, I have tried a few things to get started >> with Avro, but this is where I am stuck: >> >> >> 1. I first downloaded the stable version in the form of >> "avro-1.6.1.tar.gz". (I am working out all this on a Ubuntu 10.04 machine). >> >> I don't find a readme file and am not familar with installing a python >> package, so I am not sure if what I am doing is correct. After some basic >> googling, I did: >> >> avro-1.6.1$ ./setup.py build >> >> This appears to complete successfully. Then when I do this: >> >> ...avro-1.6.1$ sudo ./setup.py install >> >> I get an error message. (pasted at the end of this mail [1]) >> >> >> 2. I tried the technique suggested by Harsh, but it ends with a similar >> error as pasted below in [2] >> >> /avro$ sudo easy_install avro >> >> Then I tried to install snappy by itself: >> >> /avro$ sudo easy_install python-snappy >> >> I get the same error. >> >> Also I read that that this might help with this type of error, so I tried: >> >> avro$ sudo apt-get install python2.6-dev >> >> I ensured I have gcc and installed g++ too (because I wasn't sure what was >> needed). >> >> I did see a similar error message reported here for Avro and OS X: >> https://issues.apache.org/jira/browse/AVRO-981 >> >> Before installing g++ and python-dev, the error message I was seeing from >> easy_install python_snappy was different and shorter (attached below) [3]. >> >> >> >> >> Sorry if I should just be reading up on general Python development or >> packages or installs (and/or other things), before I should even be >> attempting to do this. I'll be doing that now to move further. But in case >> anyone might have suggestions for the errors I am seeing, that would be >> great. >> >> >> I did find this Quick Start Guide from the main Avro wiki page, but when I >> look through the Python example it is once again focussed client/server and >> RPC communication between them: >> >> https://github.com/phunt/avro-rpc-quickstart >> >> >> Also my understanding is that I must 'install' or deploy Avro before I can >> try out the C bindings suggested by Douglas. I am stating this since I am >> not exactly clear by what this meant: - "especially since the C bindings >> don't have any library dependencies to install". I am assuming it means, I >> don't need anything beyond a basic install of Avro. >> >> >> >> 3. With regards to the two suggested ways, would either of these >> techniques allow me to filter my data records using some sort of a condition >> on a field?(or a few fields) If not it seems like I would have to resort to >> first grepping the log file with the condition I want, and then using either >> of these two techniques to convert to CSV file. This would still be much >> better than what I am doing now, which is through not-so-pretty awk >> invocations to retrieve the fields I need (after the initial grep). But if >> the existing API, allows me to scan through the log file and specify >> conditions for fields, it might be much more efficient. I can imagine that I >> might have to use the low-level API and write a program to do this, but I am >> not sure at this point how to get started on this. >> >> >> Any pointers would be really helpful! >> >> >> Thank you, >> >> Selvi >> >> >> >> >> >> [1] >> >> >> /avro-1.6.1$ sudo ./setup.py install >> >> running install >> >> Checking .pth file support in /usr/local/lib/python2.6/dist-packages/ >> >> /usr/bin/python -E -c pass >> >> TEST PASSED: /usr/local/lib/python2.6/dist-packages/ appears to support >> .pth files >> >> running bdist_egg >> >> running egg_info >> >> writing requirements to avro.egg-info/requires.txt >> >> writing avro.egg-info/PKG-INFO >> >> writing top-level names to avro.egg-info/top_level.txt >> >> writing dependency_links to avro.egg-info/dependency_links.txt >> >> reading manifest file 'avro.egg-info/SOURCES.txt' >> >> writing manifest file 'avro.egg-info/SOURCES.txt' >> >> installing library code to build/bdist.linux-x86_64/egg >> >> running install_lib >> >> running build_py >> >> creating build/bdist.linux-x86_64 >> >> creating build/bdist.linux-x86_64/egg >> >> creating build/bdist.linux-x86_64/egg/avro >> >> copying build/lib.linux-x86_64-2.6/avro/io.py -> >> build/bdist.linux-x86_64/egg/avro >> >> copying build/lib.linux-x86_64-2.6/avro/datafile.py -> >> build/bdist.linux-x86_64/egg/avro >> >> copying build/lib.linux-x86_64-2.6/avro/tool.py -> >> build/bdist.linux-x86_64/egg/avro >> >> copying build/lib.linux-x86_64-2.6/avro/txipc.py -> >> build/bdist.linux-x86_64/egg/avro >> >> copying build/lib.linux-x86_64-2.6/avro/ipc.py -> >> build/bdist.linux-x86_64/egg/avro >> >> copying build/lib.linux-x86_64-2.6/avro/protocol.py -> >> build/bdist.linux-x86_64/egg/avro >> >> copying build/lib.linux-x86_64-2.6/avro/__init__.py -> >> build/bdist.linux-x86_64/egg/avro >> >> copying build/lib.linux-x86_64-2.6/avro/schema.py -> >> build/bdist.linux-x86_64/egg/avro >> >> byte-compiling build/bdist.linux-x86_64/egg/avro/io.py to io.pyc >> >> byte-compiling build/bdist.linux-x86_64/egg/avro/datafile.py to >> datafile.pyc >> >> byte-compiling build/bdist.linux-x86_64/egg/avro/tool.py to tool.pyc >> >> byte-compiling build/bdist.linux-x86_64/egg/avro/txipc.py to txipc.pyc >> >> byte-compiling build/bdist.linux-x86_64/egg/avro/ipc.py to ipc.pyc >> >> byte-compiling build/bdist.linux-x86_64/egg/avro/protocol.py to >> protocol.pyc >> >> byte-compiling build/bdist.linux-x86_64/egg/avro/__init__.py to >> __init__.pyc >> >> byte-compiling build/bdist.linux-x86_64/egg/avro/schema.py to schema.pyc >> >> creating build/bdist.linux-x86_64/egg/EGG-INFO >> >> installing scripts to build/bdist.linux-x86_64/egg/EGG-INFO/scripts >> >> running install_scripts >> >> running build_scripts >> >> creating build/bdist.linux-x86_64/egg/EGG-INFO/scripts >> >> copying build/scripts-2.6/avro -> >> build/bdist.linux-x86_64/egg/EGG-INFO/scripts >> >> changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/avro to 755 >> >> copying avro.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO >> >> copying avro.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO >> >> copying avro.egg-info/dependency_links.txt -> >> build/bdist.linux-x86_64/egg/EGG-INFO >> >> copying avro.egg-info/requires.txt -> >> build/bdist.linux-x86_64/egg/EGG-INFO >> >> copying avro.egg-info/top_level.txt -> >> build/bdist.linux-x86_64/egg/EGG-INFO >> >> zip_safe flag not set; analyzing archive contents... >> >> >> creating dist >> >> creating 'dist/avro-1.6.1-py2.6.egg' and adding >> 'build/bdist.linux-x86_64/egg' to it >> >> removing 'build/bdist.linux-x86_64/egg' (and everything under it) >> >> Processing avro-1.6.1-py2.6.egg >> >> Removing /usr/local/lib/python2.6/dist-packages/avro-1.6.1-py2.6.egg >> >> Copying avro-1.6.1-py2.6.egg to /usr/local/lib/python2.6/dist-packages >> >> avro 1.6.1 is already the active version in easy-install.pth >> >> Installing avro script to /usr/local/bin >> >> >> Installed /usr/local/lib/python2.6/dist-packages/avro-1.6.1-py2.6.egg >> >> Processing dependencies for avro==1.6.1 >> >> Searching for python-snappy >> >> Reading http://pypi.python.org/simple/python-snappy/ >> >> Reading http://github.com/andrix/python-snappy >> >> Best match: python-snappy 0.3.2 >> >> Downloading >> http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f >> >> Processing python-snappy-0.3.2.tar.gz >> >> Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir >> /tmp/easy_install-1J0R1s/python-snappy-0.3.2/egg-dist-tmp-luBG6u >> >> cc1plus: warning: command line option "-Wstrict-prototypes" is valid for >> Ada/C/ObjC but not for C++ >> >> snappymodule.cc:31:22: error: snappy-c.h: No such file or directory >> >> snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*, >> PyObject*)’: >> >> snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope >> >> snappymodule.cc:62: error: expected ‘;’ before ‘status’ >> >> snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not declared >> in this scope >> >> snappymodule.cc:79: error: ‘status’ was not declared in this scope >> >> snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this >> scope >> >> snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope >> >> snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*, >> PyObject*)’: >> >> snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope >> >> snappymodule.cc:107: error: expected ‘;’ before ‘status’ >> >> snappymodule.cc:120: error: ‘status’ was not declared in this scope >> >> snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared >> in this scope >> >> snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope >> >> snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this >> scope >> >> snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope >> >> snappymodule.cc: In function ‘PyObject* >> snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’: >> >> snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope >> >> snappymodule.cc:151: error: expected ‘;’ before ‘status’ >> >> snappymodule.cc:156: error: ‘status’ was not declared in this scope >> >> snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not >> declared in this scope >> >> snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope >> >> snappymodule.cc: At global scope: >> >> snappymodule.cc:41: warning: ‘_state’ defined but not used >> >> error: Setup script exited with error: command 'gcc' failed with exit >> status 1 >> >> ...avro/avro-1.6.1$ avro --help >> >> >> ************************************************************************ >> >> >> [2] /avro$ sudo easy_install avro >> >> Searching for avro >> >> Best match: avro 1.6.1 >> >> Processing avro-1.6.1-py2.6.egg >> >> avro 1.6.1 is already the active version in easy-install.pth >> >> Installing avro script to /usr/local/bin >> >> >> Using /usr/local/lib/python2.6/dist-packages/avro-1.6.1-py2.6.egg >> >> Processing dependencies for avro >> >> Searching for python-snappy >> >> Reading http://pypi.python.org/simple/python-snappy/ >> >> Reading http://github.com/andrix/python-snappy >> >> Best match: python-snappy 0.3.2 >> >> Downloading >> http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f >> >> Processing python-snappy-0.3.2.tar.gz >> >> Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir >> /tmp/easy_install-c6jLm0/python-snappy-0.3.2/egg-dist-tmp-TTWQBN >> >> cc1plus: warning: command line option "-Wstrict-prototypes" is valid for >> Ada/C/ObjC but not for C++ >> >> snappymodule.cc:31:22: error: snappy-c.h: No such file or directory >> >> snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*, >> PyObject*)’: >> >> snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope >> >> snappymodule.cc:62: error: expected ‘;’ before ‘status’ >> >> snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not declared >> in this scope >> >> snappymodule.cc:79: error: ‘status’ was not declared in this scope >> >> snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this >> scope >> >> snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope >> >> snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*, >> PyObject*)’: >> >> snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope >> >> snappymodule.cc:107: error: expected ‘;’ before ‘status’ >> >> snappymodule.cc:120: error: ‘status’ was not declared in this scope >> >> snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared >> in this scope >> >> snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope >> >> snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this >> scope >> >> snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope >> >> snappymodule.cc: In function ‘PyObject* >> snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’: >> >> snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope >> >> snappymodule.cc:151: error: expected ‘;’ before ‘status’ >> >> snappymodule.cc:156: error: ‘status’ was not declared in this scope >> >> snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not >> declared in this scope >> >> snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope >> >> snappymodule.cc: At global scope: >> >> snappymodule.cc:41: warning: ‘_state’ defined but not used >> >> error: Setup script exited with error: command 'gcc' failed with exit >> status 1 >> >> >> ************************************************************************ >> >> >> [3] >> >> python$ sudo easy_install python-snappy >> >> Searching for python-snappy >> >> Reading http://pypi.python.org/simple/python-snappy/ >> >> Reading http://github.com/andrix/python-snappy >> >> Best match: python-snappy 0.3.2 >> >> Downloading >> http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f >> >> Processing python-snappy-0.3.2.tar.gz >> >> Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir >> /tmp/easy_install-Hpzssm/python-snappy-0.3.2/egg-dist-tmp-UStJPW >> >> gcc: error trying to exec 'cc1plus': execvp: No such file or directory >> >> error: Setup script exited with error: command 'gcc' failed with exit >> status 1 >> >> >> >> >> >> On Tue, Jan 24, 2012 at 11:01 AM, Harsh J <ha...@cloudera.com> wrote: >>> >>> Selvi, >>> >>> Expanding on Douglas' response, if you have installed Avro's python >>> libraries (Simplest way to get latest stable is: "easy_install avro", >>> or install from the distribution -- Post back if you need help on >>> this), you can simply do, using the now-installed 'avro' executable: >>> >>> $ ls >>> sample_input.avro >>> >>> $ avro cat sample_input.avro --format csv >>> 011990-99999,0,-619524000000 >>> 011990-99999,22,-619506000000 >>> 011990-99999,-11,-619484400000 >>> 012650-99999,111,-655531200000 >>> 012650-99999,78,-655509600000 >>> >>> Or, write to a resultant file, as you would regularly in a shell: >>> >>> $ avro cat sample_input.avro --format csv > sample_input.csv >>> >>> For more options on avro's cat and write opts: >>> >>> $ avro --help >>> >>> On Tue, Jan 24, 2012 at 9:01 PM, selvi k <gridsngat...@gmail.com> wrote: >>> > Hello All, >>> > >>> > >>> > I would like some suggestions on where I can start in the Avro project. >>> > >>> > >>> > I want to be able to read from an Avro formatted log file (specifically >>> > the >>> > History Log file created at the end of a Hadoop job) and create a Comma >>> > Separated file of certain log entries. I need a csv file because this >>> > is the >>> > format that is accepted by post processing software I am working with >>> > (eg: >>> > Matlab). >>> > >>> > >>> > Initially I was using a BASH script to grep and awk from this file and >>> > create my CSV file because I needed a very few values from it, and a >>> > quick >>> > script just worked. I didn't try to get to know what format the log >>> > file was >>> > in and utilize that. (my bad!) Now that I need to be scaling up and >>> > want to >>> > have a reliable way to parse, I would like to try and do it the right >>> > way. >>> > >>> > >>> > My question is this: For the above goal, could you please guide me with >>> > steps I can follow - such as reading material and libraries I could try >>> > to >>> > use. As I go through the Quick Start Guide and FAQ, I see that a lot of >>> > the >>> > information here is geared to someone who wants to use the data >>> > serialization and RPC functionality provided by Avro. Given that I only >>> > want >>> > to be able to "read", where may I start? >>> > >>> > >>> > I can comfortably script with BASH and Perl. Given that I only see >>> > support >>> > for Java, Python and Ruby, I think I can take this as as opportunity to >>> > learn Python and get up to speed. >>> > >>> > >>> > Thanks a lot. >>> > >>> > >>> > -Selvi >>> > >>> > >>> >>> >>> >>> -- >>> Harsh J >>> Customer Ops. Engineer, Cloudera >> >> > -- Harsh J Customer Ops. Engineer, Cloudera