I found out what was going on here in the end. It turns out that Avro's own decoder and Parquet's Avro support don't seem to behave the same way. Parquet decodes strings into Java Strings, while Avro seems to just wrap the byte array in this wrapper: https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/util/Utf8.java , avoiding the cost of decoding until someone wants the string value.
Since my code doesn't need to actually read most of the map values on each run, the Avro decoder approach worked a lot faster for me. I can get around this by just using 'bytes' rather than 'string' and doing the decode myself where necessary, so that's fine. On 7 May 2015 at 22:27, Alex Levenson <[email protected]> wrote: > Are you comparing the read speed on a hadoop cluster, or locally on a > single machine? In a micro benchmark like this, using hadoop local mode for > parquet, but not for avro, could introduce a lot of overhead. Just curious > how you're doing the comparison. > > On Thu, May 7, 2015 at 1:06 PM, Robert Synnott <[email protected]> wrote: > >> Hi, >> I just started trying out Parquet, and ran into a performance issue. I >> was using the Avro support to try working with a test schema, using >> the 'standalone' approach from here: >> >> http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/ >> >> I took an existing Avro schema, consisting of a few columns each >> containing a map, and wrote, then read back, about 40MB of data using >> both Avro's own serialisation, and Parquet's. Parquet's ended up being >> about five times slower. This ratio was maintained when I moved to >> using ~1GB data. I'd expect it to be a little slower, as I was reading >> back all columns, but five times seems high. Is there anything simple >> I might be missing? >> Thanks >> Rob >> > > > > -- > Alex Levenson > @THISWILLWORK -- Robert Synnott http://myblog.rsynnott.com MSN: [email protected] Jabber: [email protected]
