Are you comparing the read speed on a hadoop cluster, or locally on a single machine? In a micro benchmark like this, using hadoop local mode for parquet, but not for avro, could introduce a lot of overhead. Just curious how you're doing the comparison.
On Thu, May 7, 2015 at 1:06 PM, Robert Synnott <[email protected]> wrote: > Hi, > I just started trying out Parquet, and ran into a performance issue. I > was using the Avro support to try working with a test schema, using > the 'standalone' approach from here: > > http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/ > > I took an existing Avro schema, consisting of a few columns each > containing a map, and wrote, then read back, about 40MB of data using > both Avro's own serialisation, and Parquet's. Parquet's ended up being > about five times slower. This ratio was maintained when I moved to > using ~1GB data. I'd expect it to be a little slower, as I was reading > back all columns, but five times seems high. Is there anything simple > I might be missing? > Thanks > Rob > -- Alex Levenson @THISWILLWORK
